Hello!
I keep having this error when I run my program and I have seen many people struggling with the same problem, but since I am a beginner with RStudio and I struggle a lot with computers, I don't know how to solve it by myself
The program that I use is from this tutorial (sorry it's in French). It is supposed to provide a way to properly format a corpus of news papers publications (in HTML format), to be able to perform a lexicometric analysis with Iramuteq .
Here is the whole part that I'm struggling with (I'm sorry if it's long, I have no idea what I'm supposed to add or not ).
setwd(dir = "D:/testrstudio")
load.lib <- c("xml2", "XML", "stringr", "stringdist", "stringi","lubridate", "dplyr", "tidyr", "ggplot2")
install.lib <- load.lib[!load.lib %in% installed.packages()]
for (lib in install.lib) install.packages(lib,dependencies=TRUE)
sapply(load.lib,require,character=TRUE)
#> Le chargement a nécessité le package : xml2
#> Le chargement a nécessité le package : XML
#> Le chargement a nécessité le package : stringr
#> Le chargement a nécessité le package : stringdist
#> Le chargement a nécessité le package : stringi
#> Le chargement a nécessité le package : lubridate
#>
#> Attachement du package : 'lubridate'
#> Les objets suivants sont masqués depuis 'package:base':
#>
#> date, intersect, setdiff, union
#> Le chargement a nécessité le package : dplyr
#>
#> Attachement du package : 'dplyr'
#> Les objets suivants sont masqués depuis 'package:stats':
#>
#> filter, lag
#> Les objets suivants sont masqués depuis 'package:base':
#>
#> intersect, setdiff, setequal, union
#> Le chargement a nécessité le package : tidyr
#>
#> Attachement du package : 'tidyr'
#> L'objet suivant est masqué depuis 'package:stringdist':
#>
#> extract
#> Le chargement a nécessité le package : ggplot2
#> xml2 XML stringr stringdist stringi lubridate dplyr
#> TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> tidyr ggplot2
#> TRUE TRUE
LIRE <- function(html) {
doc <- htmlParse(html)
articles <- getNodeSet(doc, "//article")
journal <- sapply(articles, function(art) {
journ <- xpathSApply(art, "./header/div[1]/span/text()", xmlValue)
journ[[1]]
})
auteur <- sapply(articles, function(art) {
aut <- xpathSApply(art, "./header/div[@class='docAuthors']/text()", xmlValue)
aut <- aut[[1]]
if (is.null(aut)) aut <- NA
aut
})
titre <- sapply(articles, function(art) {
tmp <- xpathSApply(art, "./header/div[@class='titreArticle']//text()", xmlValue)
if (is.null(tmp)) tmp <- NA
tmp <- paste(tmp, collapse = "")
str_trim(tmp)
})
date <- sapply(articles, function(art) {
tmp <- xpathSApply(art, ".//div[@class='publiC-lblNodoc']//text()", xmlValue)
if (is.null(tmp)) tmp <- NA
tmp <- substr(tmp, 6, 13)
tmp
})
date <- as.Date(date, "%Y%m%d")
texte <- sapply(articles, function(art) {
tmp <- xpathSApply(art, ".//div[@class='DocText clearfix']//text()", xmlValue)
if (is.null(tmp)) tmp <- NA
tmp <- paste(tmp, collapse = "")
str_trim(tmp)
})
txt <- data.frame(Journal = journal,
Titre = titre,
Date = date,
Auteur = auteur,
Texte = texte)
txt <- subset(txt, !is.na(Journal) & !is.na(Titre))
txt
}
lire_dossier <- function(chemin) {
list<-list.files(chemin, pattern= ".HTML", full.names=TRUE, recursive=TRUE)
l <- lapply(list, function(file) {
print(file)
LIRE(html=file)
})
bind_rows(l)
}
test <- lire_dossier("D:/testrstudio/datas")
#> [1] "D:/testrstudio/datas/biblioeuropresse20230305152505.HTML"
#> Error in data.frame(Journal = journal, Titre = titre, Date = date, Auteur = auteur, : les arguments impliquent des nombres de lignes différents : 0, 50
write.csv2(test, file="test1.csv", row.names = FALSE)
#> Error in is.data.frame(x): objet 'test' introuvable
Created on 2023-03-06 with reprex v2.0.2
The error is in french but it's the same as "arguments imply differing number of rows".
I can't figure out how to join the HTML file that I use with this program, if you need it, let me know and I'll try to find another way to send it to you!