data frame error: "arguments imply differing number of rows"

whyisitsocomplicated · March 6, 2023, 3:55pm

Hello!

I keep having this error when I run my program and I have seen many people struggling with the same problem, but since I am a beginner with RStudio and I struggle a lot with computers, I don't know how to solve it by myself

The program that I use is from this tutorial (sorry it's in French). It is supposed to provide a way to properly format a corpus of news papers publications (in HTML format), to be able to perform a lexicometric analysis with Iramuteq .

Here is the whole part that I'm struggling with (I'm sorry if it's long, I have no idea what I'm supposed to add or not ).

setwd(dir = "D:/testrstudio")

load.lib <- c("xml2", "XML", "stringr", "stringdist", "stringi","lubridate", "dplyr", "tidyr", "ggplot2")

install.lib <- load.lib[!load.lib %in% installed.packages()] 

for (lib in install.lib) install.packages(lib,dependencies=TRUE) 

sapply(load.lib,require,character=TRUE) 
#> Le chargement a nécessité le package : xml2
#> Le chargement a nécessité le package : XML
#> Le chargement a nécessité le package : stringr
#> Le chargement a nécessité le package : stringdist
#> Le chargement a nécessité le package : stringi
#> Le chargement a nécessité le package : lubridate
#> 
#> Attachement du package : 'lubridate'
#> Les objets suivants sont masqués depuis 'package:base':
#> 
#>     date, intersect, setdiff, union
#> Le chargement a nécessité le package : dplyr
#> 
#> Attachement du package : 'dplyr'
#> Les objets suivants sont masqués depuis 'package:stats':
#> 
#>     filter, lag
#> Les objets suivants sont masqués depuis 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Le chargement a nécessité le package : tidyr
#> 
#> Attachement du package : 'tidyr'
#> L'objet suivant est masqué depuis 'package:stringdist':
#> 
#>     extract
#> Le chargement a nécessité le package : ggplot2
#>       xml2        XML    stringr stringdist    stringi  lubridate      dplyr 
#>       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
#>      tidyr    ggplot2 
#>       TRUE       TRUE

LIRE <- function(html) {
  
  doc <- htmlParse(html) 
  
  articles <- getNodeSet(doc, "//article") 
  
  journal <- sapply(articles, function(art) {
    journ <- xpathSApply(art, "./header/div[1]/span/text()", xmlValue)
    journ[[1]]
  })
  
  
  auteur <- sapply(articles, function(art) { 
    aut <- xpathSApply(art, "./header/div[@class='docAuthors']/text()", xmlValue)
    aut <- aut[[1]]
    if (is.null(aut)) aut <- NA
    aut
  })
  
  titre <- sapply(articles, function(art) { 
    tmp <- xpathSApply(art, "./header/div[@class='titreArticle']//text()", xmlValue)
    if (is.null(tmp)) tmp <- NA
    tmp <- paste(tmp, collapse = "")
    
    str_trim(tmp)
  })
  
  date <- sapply(articles, function(art) { 
    tmp <- xpathSApply(art, ".//div[@class='publiC-lblNodoc']//text()", xmlValue)
    if (is.null(tmp)) tmp <- NA
    tmp <- substr(tmp, 6, 13)
    tmp
  })
  date <- as.Date(date, "%Y%m%d") 
  
  texte <- sapply(articles, function(art) { 
    tmp <- xpathSApply(art, ".//div[@class='DocText clearfix']//text()", xmlValue)
    if (is.null(tmp)) tmp <- NA
    tmp <- paste(tmp, collapse = "")
    str_trim(tmp)
  })
  
  
  txt <- data.frame(Journal = journal,
                    Titre = titre,
                    Date = date,
                    Auteur = auteur,
                    Texte = texte)  
  txt <- subset(txt, !is.na(Journal) & !is.na(Titre))
  
  txt
  
}

lire_dossier <- function(chemin) {
  
  list<-list.files(chemin, pattern= ".HTML", full.names=TRUE, recursive=TRUE)
  
  l <- lapply(list, function(file) {
    print(file)
    LIRE(html=file)
  })
  bind_rows(l)
  
}

test <- lire_dossier("D:/testrstudio/datas")
#> [1] "D:/testrstudio/datas/biblioeuropresse20230305152505.HTML"
#> Error in data.frame(Journal = journal, Titre = titre, Date = date, Auteur = auteur, : les arguments impliquent des nombres de lignes différents : 0, 50
write.csv2(test, file="test1.csv", row.names = FALSE)
#> Error in is.data.frame(x): objet 'test' introuvable

^{Created on 2023-03-06 with reprex v2.0.2}

The error is in french but it's the same as "arguments imply differing number of rows".

I can't figure out how to join the HTML file that I use with this program, if you need it, let me know and I'll try to find another way to send it to you!

technocrat · March 6, 2023, 6:35pm

is where the error message arises. If the file

is the first file returned from

then it could possibly be something that is happening with

should be ok with list objects.

Could you take the first two of the html files and create two objects, l1 and l2 with your LIRE() function and try

bind_rows(l1,l2)

to help see what's happening?

whyisitsocomplicated · March 7, 2023, 10:20am

Hello, thank you so much for your answer!

As I said I'm not good with R Studio, since I downloaded it two days ago, so I'm not sure how to create two objects in the program.

I just copy pasted the whole thing from the website in my og post, and the part you are talking about was designed to read an entire folder, so i don't really know how to change it to make it read only two files.

nirgrahamuk · March 7, 2023, 11:30am

I think probably the failure case is data.frame construction in LIRE itself.
Do you have hundreds of files, or a handful ?

one way to run your code on just two is to move all but 2 files out of the folder into some other folder...

because I suspect the LIRE, I would make a helper function plength to print the sizes of the parts that were to go into each data.frame so I could see where the numbers differ.

something like :

plength <- function(x){
  xt <-substitute(x)
  print(paste(xt,length(x)))
}

LIRE <- function(html) {
  
  doc <- htmlParse(html) 
  
  articles <- getNodeSet(doc, "//article") 
  
  journal <- sapply(articles, function(art) {
    journ <- xpathSApply(art, "./header/div[1]/span/text()", xmlValue)
    journ[[1]]
  })
 
  
  
  auteur <- sapply(articles, function(art) { 
    aut <- xpathSApply(art, "./header/div[@class='docAuthors']/text()", xmlValue)
    aut <- aut[[1]]
    if (is.null(aut)) aut <- NA
    aut
  })

  
  titre <- sapply(articles, function(art) { 
    tmp <- xpathSApply(art, "./header/div[@class='titreArticle']//text()", xmlValue)
    if (is.null(tmp)) tmp <- NA
    tmp <- paste(tmp, collapse = "")
    
    str_trim(tmp)
  })
 
  
  date <- sapply(articles, function(art) { 
    tmp <- xpathSApply(art, ".//div[@class='publiC-lblNodoc']//text()", xmlValue)
    if (is.null(tmp)) tmp <- NA
    tmp <- substr(tmp, 6, 13)
    tmp
  })
  date <- as.Date(date, "%Y%m%d") 
 
  
  texte <- sapply(articles, function(art) { 
    tmp <- xpathSApply(art, ".//div[@class='DocText clearfix']//text()", xmlValue)
    if (is.null(tmp)) tmp <- NA
    tmp <- paste(tmp, collapse = "")
    str_trim(tmp)
  })
  
  plength(journal)
  plength(titr)
  plength(date)
  plength(auteur)
  plength(texte)
  print("__end__")
  txt <- data.frame(Journal = journal,
                    Titre = titre,
                    Date = date,
                    Auteur = auteur,
                    Texte = texte)  
  txt <- subset(txt, !is.na(Journal) & !is.na(Titre))
  
  txt
  
}

whyisitsocomplicated · March 7, 2023, 8:24pm

Previously, I used only one HTLM file containing 50 papers to test the program.

This time, I tried to use the new program you sent me with only two HTLM files containing 1 paper each, but it seems like the error is the same

setwd(dir = "D:/testrstudio")

load.lib <- c("xml2", "XML", "stringr", "stringdist", "stringi","lubridate", "dplyr", "tidyr", "ggplot2")

install.lib <- load.lib[!load.lib %in% installed.packages()] 

for (lib in install.lib) install.packages(lib,dependencies=TRUE) 

sapply(load.lib,require,character=TRUE) 
#> Le chargement a nécessité le package : xml2
#> Le chargement a nécessité le package : XML
#> Le chargement a nécessité le package : stringr
#> Le chargement a nécessité le package : stringdist
#> Le chargement a nécessité le package : stringi
#> Le chargement a nécessité le package : lubridate
#> 
#> Attachement du package : 'lubridate'
#> Les objets suivants sont masqués depuis 'package:base':
#> 
#>     date, intersect, setdiff, union
#> Le chargement a nécessité le package : dplyr
#> 
#> Attachement du package : 'dplyr'
#> Les objets suivants sont masqués depuis 'package:stats':
#> 
#>     filter, lag
#> Les objets suivants sont masqués depuis 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Le chargement a nécessité le package : tidyr
#> 
#> Attachement du package : 'tidyr'
#> L'objet suivant est masqué depuis 'package:stringdist':
#> 
#>     extract
#> Le chargement a nécessité le package : ggplot2
#>       xml2        XML    stringr stringdist    stringi  lubridate      dplyr 
#>       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
#>      tidyr    ggplot2 
#>       TRUE       TRUE

plength <- function(x){
  xt <-substitute(x)
  print(paste(xt,length(x)))
}

LIRE <- function(html) {
  
  doc <- htmlParse(html) 
  
  articles <- getNodeSet(doc, "//article") 
  
  journal <- sapply(articles, function(art) {
    journ <- xpathSApply(art, "./header/div[1]/span/text()", xmlValue)
    journ[[1]]
  })
  
  
  
  auteur <- sapply(articles, function(art) { 
    aut <- xpathSApply(art, "./header/div[@class='docAuthors']/text()", xmlValue)
    aut <- aut[[1]]
    if (is.null(aut)) aut <- NA
    aut
  })
  
  
  titre <- sapply(articles, function(art) { 
    tmp <- xpathSApply(art, "./header/div[@class='titreArticle']//text()", xmlValue)
    if (is.null(tmp)) tmp <- NA
    tmp <- paste(tmp, collapse = "")
    
    str_trim(tmp)
  })
  
  
  date <- sapply(articles, function(art) { 
    tmp <- xpathSApply(art, ".//div[@class='publiC-lblNodoc']//text()", xmlValue)
    if (is.null(tmp)) tmp <- NA
    tmp <- substr(tmp, 6, 13)
    tmp
  })
  date <- as.Date(date, "%Y%m%d") 
  
  
  texte <- sapply(articles, function(art) { 
    tmp <- xpathSApply(art, ".//div[@class='DocText clearfix']//text()", xmlValue)
    if (is.null(tmp)) tmp <- NA
    tmp <- paste(tmp, collapse = "")
    str_trim(tmp)
  })
  
  plength(journal)
  plength(titre)
  plength(date)
  plength(auteur)
  plength(texte)
  print("__end__")
  txt <- data.frame(Journal = journal,
                    Titre = titre,
                    Date = date,
                    Auteur = auteur,
                    Texte = texte)  
  txt <- subset(txt, !is.na(Journal) & !is.na(Titre))
  
  txt
  
}

lire_dossier <- function(chemin) {
  
  list<-list.files(chemin, pattern= ".HTML", full.names=TRUE, recursive=TRUE)
  
  l <- lapply(list, function(file) {
    print(file)
    LIRE(html=file)
  })
  bind_rows(l)
  
}

test <- lire_dossier("D:/testrstudio/datas")
#> [1] "D:/testrstudio/datas/l1.HTML"
#> [1] "journal 1"
#> [1] "titre 1"
#> [1] "date 1"
#> [1] "auteur 1"
#> [1] "texte 1"
#> [1] "__end__"
#> Error in data.frame(Journal = journal, Titre = titre, Date = date, Auteur = auteur, : les arguments impliquent des nombres de lignes différents : 0, 1
write.csv2(test, file="test1.csv", row.names = FALSE)
#> Error in is.data.frame(x): objet 'test' introuvable

^{Created on 2023-03-07 with reprex v2.0.2}

nirgrahamuk · March 7, 2023, 10:37pm

This is extremely suprising, because it claims that each entry into the data.frame is length one, but that the data.frame is failing because the arguments imply differing number of rows.

To try and understand this; lets keep your earlier set up and tweak it further slightly. change data.frame to a list and use dput to print out a construct that is reproducible as a further step

edit LIRE

...
plength(auteur)
  plength(texte)
  print("__end__")
  tlist<- list(Journal = journal,
                    Titre = titre,
                    Date = date,
                    Auteur = auteur,
                    Texte = texte)  
dput(tlist)
txt <- as.data.frame(tlist)
  txt <- subset(txt, !is.na(Journal) & !is.na(Titre))

whyisitsocomplicated · March 8, 2023, 11:38pm

I feel like it's already working better! The error message is still there, but now the program is recognising what the paper's title is and who is the author (even if the information is not available for the one I downloaded).

setwd(dir = "D:/testrstudio")

load.lib <- c("xml2", "XML", "stringr", "stringdist", "stringi","lubridate", "dplyr", "tidyr", "ggplot2")

install.lib <- load.lib[!load.lib %in% installed.packages()] 

for (lib in install.lib) install.packages(lib,dependencies=TRUE) 

sapply(load.lib,require,character=TRUE) 
#> Le chargement a nécessité le package : xml2
#> Le chargement a nécessité le package : XML
#> Le chargement a nécessité le package : stringr
#> Le chargement a nécessité le package : stringdist
#> Le chargement a nécessité le package : stringi
#> Le chargement a nécessité le package : lubridate
#> 
#> Attachement du package : 'lubridate'
#> Les objets suivants sont masqués depuis 'package:base':
#> 
#>     date, intersect, setdiff, union
#> Le chargement a nécessité le package : dplyr
#> 
#> Attachement du package : 'dplyr'
#> Les objets suivants sont masqués depuis 'package:stats':
#> 
#>     filter, lag
#> Les objets suivants sont masqués depuis 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Le chargement a nécessité le package : tidyr
#> 
#> Attachement du package : 'tidyr'
#> L'objet suivant est masqué depuis 'package:stringdist':
#> 
#>     extract
#> Le chargement a nécessité le package : ggplot2
#>       xml2        XML    stringr stringdist    stringi  lubridate      dplyr 
#>       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
#>      tidyr    ggplot2 
#>       TRUE       TRUE

plength <- function(x){
  xt <-substitute(x)
  print(paste(xt,length(x)))
}

LIRE <- function(html) {
  
  doc <- htmlParse(html) 
  
  articles <- getNodeSet(doc, "//article") 
  
  journal <- sapply(articles, function(art) {
    journ <- xpathSApply(art, "./header/div[1]/span/text()", xmlValue)
    journ[[1]]
  })
 
  
  
  auteur <- sapply(articles, function(art) { 
    aut <- xpathSApply(art, "./header/div[@class='docAuthors']/text()", xmlValue)
    aut <- aut[[1]]
    if (is.null(aut)) aut <- NA
    aut
  })

  
  titre <- sapply(articles, function(art) { 
    tmp <- xpathSApply(art, "./header/div[@class='titreArticle']//text()", xmlValue)
    if (is.null(tmp)) tmp <- NA
    tmp <- paste(tmp, collapse = "")
    
    str_trim(tmp)
  })
 
  
  date <- sapply(articles, function(art) { 
    tmp <- xpathSApply(art, ".//div[@class='publiC-lblNodoc']//text()", xmlValue)
    if (is.null(tmp)) tmp <- NA
    tmp <- substr(tmp, 6, 13)
    tmp
  })
  date <- as.Date(date, "%Y%m%d") 
 
  
  texte <- sapply(articles, function(art) { 
    tmp <- xpathSApply(art, ".//div[@class='DocText clearfix']//text()", xmlValue)
    if (is.null(tmp)) tmp <- NA
    tmp <- paste(tmp, collapse = "")
    str_trim(tmp)
  })
  
  plength(journal)
  plength(titre)
  plength(date)
  plength(auteur)
  plength(texte)
  print("__end__")
 tlist<- list(Journal = journal,
                    Titre = titre,
                    Date = date,
                    Auteur = auteur,
                    Texte = texte)  
dput(tlist)
txt <- as.data.frame(tlist)
  txt <- subset(txt, !is.na(Journal) & !is.na(Titre))
  
}

lire_dossier <- function(chemin) {

  list<-list.files(chemin, pattern= ".HTML", full.names=TRUE, recursive=TRUE)

  l <- lapply(list, function(file) {
    print(file)
    LIRE(html=file)
  })
  bind_rows(l)
  
}

test <- lire_dossier("D:/testrstudio/datas")
#> [1] "D:/testrstudio/datas/l1.HTML"
#> [1] "journal 1"
#> [1] "titre 1"
#> [1] "date 1"
#> [1] "auteur 1"
#> [1] "texte 1"
#> [1] "__end__"
#> list(Journal = list(NULL), Titre = "Les dealers agissaient sur les réseaux sociaux", 
#>     Date = structure(18940, class = "Date"), Auteur = NA, Texte = "Clément et Mathieu, 25 ans, étaient côte à côte, lundi 8novembre, à la barre du tribunal de Bayonne. Les deux jeunes hommes animaient un petit réseau de drogue, comme il y en a beaucoup au Pays basque. La nouveauté, c’est la méthode. Avec Snapchat, le réseau social préféré des jeunes, pas besoin de stock, ni de clientèle, pour faire de l’argent facile, pourrait vanter la publicité.  Dans son appartement de la rue moulin de Sault, à Anglet, Clément opérait sous le pseudo «C cool, c cool». Il avait développé sa «supérette locale», selon les mots du procureur, Marc Mariée. Des acheteurs montaient et descendaient toutes les dix minutes. Lorsque le jeune homme est dénoncé, le 5octobre 2021, la BAC se met en planque, l’après-midi, dans sa rue.   À la sortie de l’appartement, une jeune femme est interpellée avec cinq cachets d’ecstasy. Elle explique aux policiers qu’elle suit «C cool, c cool» sur Snapchat. Il change régulièrement de pseudo, mais elle ne le perd pas de vue. Dans le coin, en matière de drogue dure, selon elle, personne n’a autant de choix. Ce mardi 5octobre, elle le contacte à 15h30. Il lui envoie l’adresse. Elle monte au deuxième étage, lâche 50euros et redescend dans la rue, en moins d’une minute top chrono.   La police perquisitionne l’appartement de «C cool, c cool» et découvre «la supérette locale». 206 grammes de cannabis. 17 grammes de cocaïne. 85 cachets d’ecstasy et 10 grammes de MDMA. En garde à vue, Clément remercie les policiers «d’avoir arrêté tout ça». Son marché avait pris trop d’ampleur, reconnaît-il. «C cool, c cool» commandait 2650euros de produits, deux fois par mois, à son fournisseur, Mathieu, alias «Ara beleck» sur Snapchat. Pour ferrer le poisson, le procureur autorise «un coup d’achat» avec le pseudo «C cool, c cool».   «Il te faut quoi?». «10 grammes de cocaïne». «Quand?», «Maintenant.» Au pied de la tour de Balichon à Bayonne, où ilvit chez son père, «Ara beleck» est interpellé à son tour. Mathieu a le pochon, et 1120euros en espèces, dans la poche. «Je l’ai vu deux fois, en soirée. Je ne le connais pas.Il m’a piégé», grince-t-il. Chez lui, 50 grammes de haschisch, 12 grammes de cocaïne, et de la MDMA. Pourquoi un fusil Beretta? «Le produit m’a rendu parano». Le brassard police et la cagoule? «Pour me déguiser à carnaval».   Clément, le revendeur, a un casier vierge. Mathieu, le fournisseur, a déjà été condamné six fois. Le premier a été placé sous contrôle judiciaire. Il est retourné chez ses parents, à Artiguelouve, dans le Béarn, et justifie d’un emploi en usine. Le second, placé en détention provisoire, depuis le mois d’octobre, a déjà reçu un avertissement pour les stupéfiants, en 2016.   Le tribunal a suivi les réquisitions, en condamnant, le vendeur et le fournisseur à deux ans d’emprisonnement, dont dix-huit mois ferme. Ils subiront la même peine, à la différence que «C cool, c cool» n’ira pas directement en prison à l’issue de l’audience alors qu’«Ara beleck» a été maintenu en détention.")
#> Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : les arguments impliquent des nombres de lignes différents : 0, 1
write.csv2(test, file="test1.csv", row.names = FALSE)
#> Error in is.data.frame(x): objet 'test' introuvable

^{Created on 2023-03-09 with reprex v2.0.2}

nirgrahamuk · March 9, 2023, 10:01am

whyisitsocomplicated:

  print("__end__")
 tlist<- list(Journal = journal,
                    Titre = titre,
                    Date = date,
                    Auteur = auteur,
                    Texte = texte)

before your list add a step to address if journal comes as a list of NULL (though this is not guaranteed to help you as I don't know what journal 'should be'

 print("__end__")
journal <- sapply(journal,\(x){if(is.null(x)){NA_character_}else{x}})
 tlist<- list(Journal = journal,
                    Titre = titre,
                    Date = date,
                    Auteur = auteur,
                    Texte = texte)

system · April 20, 2023, 10:02am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.