For loop or other possible functions to repeat the frequency counting of the particular tags (features) in multiple XML texts

Hi, R community! Thanks in advance for considering my enquiry.

I want to write a script that does the following job below.

  1. Input files: 1115 XML files saved in my directory (tagged for 39 features)
  2. Operation: repeat counting each of 46 features in each XML file
  3. Output: a frequency table looking like the one below

image

So far, I have the script below using the package ‘xml2’. The script both finds each of the features and tells me how many times they appear. For example, in the script below, I wanted to count the frequency of ‘tag1’ in the XML file named as ‘1_1’. The ‘tag1’ feature appears within dependency sections in the XML files.

setwd('~/my directory/')

install.packages('xml2')
library(xml2)

text <- read_xml(x = '1_1.xml')

# find dependency sections
dependencies <- xml_find_all(text, './/dependencies')

# find '<dep...> tags
deps <- xml_find_all(collapsed, './/dep')

# find tags for the features you are interested in, e.g.: ‘tag1’
tag1 <- deps[grep('type="tag1"', deps)]
N_tag1- length(tag1)

The code above needs improvement for two aspects.
First of all, the code above does not repeat the job for each of the 39 tags (e.g., 'tag1’) in 1,115 XML texts.
So the code needs to be written so that R repeats counting each of 39 tags in each of 1,115 XML texts at one go or a much-reduced number of scripts than the script above.
I think that the ‘For loop’ function might do the job, but I don’t know how to rewrite the script using it.

Second, the frequencies of each tag (e.g., ‘tag1’) needs to be assembled in a frequency table as an output. I have no idea about the function that can do that once the frequencies of the tags are counted in all the texts.

Any suggestions will be much appreciated. Thanks for reading this question :slight_smile:

See the FAQ: How to do a minimal reproducible example reprex for beginners. Questions that don't require creating a dataset to test on usually receive more and more specific answers.

1 Like

Let's create frequency table first:

freqTable <- data.frame(matrix(NA, ncol=40, nrow=0))
a <- c("Filename")
for(i in 1:39) {
 a <- append(a, paste0("tag",i))
}
colnames(freqTable) <- a
freqTable[, 2:40] <- sapply(freqTable[, 2:40], as.numeric)

Let's list all files in folder:

myFiles <- list.files(path = "~/my directory", full.names = FALSE, pattern=".xml")

Let's create a loop across the files:

for (i in 1:length(myFiles)) {
 print(myFiles[i])
 # lets write filename to 1st column of freqTable
 freqTable[i,1] <- myFiles[i]

 # here you have to do the stuff which you like with the file
 #  
 text <- read_xml(x = myFiles[i])
 # find dependency sections
 dependencies <- xml_find_all(text, './/dependencies')
 # find '<dep...> tags
 deps <- xml_find_all(collapsed, './/dep')
 # find tags for the features you are interested in, e.g.: ‘tag1’
 # as there is 39 such tags, lets create another loop
 for(j in 1:39) {
  MySuperTag <- deps[grep('type="paste0("tag",i)"', deps)]
 # and update the value of the corresponding cell in freqTable
 freqTable[i,j+1] <- length(MySuperTag)
 }
}

Not tested, but you got the idea. There has to be two loops, one for files, the second one for tags.
Instead of updating single cells in frequency table you can collect them in tibble and rbind, or use dplyr:add_rows(), whatever.

Hope it helps,
Grzegorz

1 Like

Hi Grzegorz,

Thank you so much for your kind help, which works beautifully.
I do have one more issue, though.
How can I list all 39 tag names rather than tag1, tag2, and so on?
Actually, the tags are like acomp, advmod, and so on. I just wrote them as tag1, tag2 for simplicity :sweat_smile:
It seems that the following bit of code needs to be changed, but I am making errors rewriting that bit.

paste0("tag",I)

My rewritten script is as below. Thank you so much in advance :smiley:

library(xml2)

tags <- read.csv('~/39 measures.csv')
tags <- stringi::stri_c(tags)
tags

freqTable <- data.frame(matrix(NA, ncol=40, nrow=0))

colnames(freqTable) <- tags

freqTable
freqTable[, 2:40] <- sapply(freqTable[,2:40],as.numeric)

myFiles <- list.files(path="my directory", full.names=FALSE, pattern=".xml")

for (i in 1:length(myFiles)) {
  print(myFiles[i])
  freqTable[i,1] <- myFiles[1]
  text <- read_xml(x=myFiles[i])
  dependencies <- xml_find_all(text, './/dependencies')
  collapsed <- dependencies[grep('collapsed-dependencies',dependencies)]
  deps <- xml_find_all(collapsed, './/dep')
  for(j in 1:39) {
    MySuperTag <- deps[grep('type="paste0(tags)"', deps)]
freqTable[i,j+1] <- length(MySuperTag)
      }
}```

Thanks, I will keep that in mind and definitely try that next time!

1 Like

Around here you can search for all attributes:

library(purrr)
deps %>% 
  purrr::map(~names(xml_attrs(.))) %>%
  unlist() %>% 
  unique()

it will create a vector of attr names. Then iterate it.
Or better, create it on beginning, assign to variable and iterate through it.

Regards,
Grzegorz

1 Like

Thank you so much! Your demonstration is super clear. Hope you have a great day :smiley:

best regards,
Sangeun

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.