Problem creating a script to calculate/summarise certain values from a csv file

ledgreve · May 20, 2023, 10:42am

Hello everyone,

I have a bunch of csv-files (one per corpus) containing annotated data in multiple columns. I wish to look at the results of an aspect based sentiment analysis. Unfortunately it's a lot of different files and outputs to calculate manually. It would be very helpful if I could use a script to automatically do the calculations instead of using the filter.

The first type of document simply contains the annotated aspects: the name of the annotated file ("Document"), the annotated target wordgroup ("Aspect"), and the aspect category ("lore"). Here I would like to know how many of each category there are.

The second type of document contains my annotated data combining aspect and seniment. The only relevant columns are "Aspect category", "Polarity lore" and "Irony lore". The first column contains the category labels which have been assigned to the aspects, such as "Text". The second column assigns a sentiment label to the aspect, namely "Positive" or "Negative" (sometimes there is failed label). The last column tells you wether the sentiment is meant ironically ("true" or "false"). I wish to know how often each aspect category is mentioned in combination with either positive or negative sentiment and how many of these positive or negative sentiments were meant ironically.

Thirdly, I have files containing automatically annotated data, with several columns. The only relevant ones are a column for the aspect category ("Category_Fine") and for the pertaining sentiment ("Polarity").

I wanted to try my hand at the second one first since it seemed the most complicated one. My coding knowledge, however, is very limited. My colleague recommended creating a script via chatGPT . No true surprizes there, but it did not work. Of course, it couldn't hurt to try. I would be very grateful if anyone would be willing and have the time to help.

I will included the failing script below, as it might be of some help? I also have an example csv-file containing only some 25 lines to test it on, but am not sure how I can include it here, so I'll just add the lines at the end, I guess?

library(tidyverse)

# Set the path to your CSV files directory
csv_directory <- "path/to/csv/files"

# Get a list of all CSV files in the directory
csv_files <- list.files(csv_directory, pattern = "\\.csv$", full.names = TRUE)

# Initialize an empty data frame to store the results
results <- data.frame(Category = character(), Positive = integer(), Negative = integer(), Irony_True = integer(), Irony_False = integer(), stringsAsFactors = FALSE)

# Loop through each CSV file
for (csv_file in csv_files) {
  # Read the CSV file into a data frame
  df <- read_csv(csv_file)
  
  # Filter relevant columns: Aspect category, Polarity lore, and Irony lore
  df_filtered <- df %>% 
    select(`Aspect category`, `Polarity lore`, `Irony lore`)
  
  # Calculate the counts for each category, sentiment, and irony combination
  counts <- df_filtered %>% 
    group_by(`Aspect category`, `Polarity lore`, `Irony lore`) %>% 
    summarise(count = n()) %>% 
    ungroup()
  
  # Add the counts to the results data frame
  results <- bind_rows(results, counts)
}

# Write the results to a new CSV file
write_csv(results, "output_file.csv")

Thank you in advance for your time, patience and help!

The example data (tab delimiter):

Document	Aspect	Aspect category	Polarity Trigger	Span lore	Polarity lore	Irony lore
Testfile100_xmi	Jan Wagner	CONTENDER_General	Sprachkünstler	[(31, 45)]	Positive	false
Testfile101_xmi	Gewinner des #Büchnerpreis	META_Winner_Award-Ceremony	Gratulation!	[(129, 141)]	Positive	false
Testfile101_xmi	er	CONTENDER_General	Gratulation!	[(129, 141)]	Positive	false
Testfile103_xmi	Scheißgedicht	TEXT_General	Scheißgedicht	[(19, 32)]	Negative	true
Testfile106_xmi	Lesung	READING_General	konzertanten	[(99, 111)]	Positive	false
Testfile109_xmi	Alternative Büchnerpreis	META_Literature_Literary-Prizes	klotzen!	[(18, 26)]	Positive	false
Testfile109_xmi	Wondratschek	ALLO-REFERENCES_TEXT_Other-Author	Alternative Büchnerpreis sollte jedes Jahr an Wondratschek gehen	[(31, 95)]	Positive	false
Testfile10_xmi	Büchnerpreisträger	META_Winner_Award-Ceremony	ganz besonderen	[(12, 27)]	Positive	false
Testfile10_xmi	Büchnerpreisträger	CONTENDER_General	ganz besonderen	[(12, 27)]	Positive	false
Testfile110_xmi	LITERARISCHEN WELT	META_Literature_Literary-Prizes	geht auch nur in der LITERARISCHEN WELT	[(81, 120)]	Negative	false
Testfile119_xmi	Jan Wagner	CONTENDER_General	the Billy Collins of contemporary German poetry	[(61, 108)]	Positive	false
Testfile121_xmi	Gedicht über den Giersch	TEXT_General-Content_Plot	sensationelles	[(14, 28)]	Positive	false
Testfile122_xmi	Büchnerpreis-Jury zeichnet	JURY_Discussion_Valuation	zeichnet im Schnitt weiße Männer um 61 aus, die bei den vier großen Verlagen erscheinen	[(24, 111)]	Negative	false
Testfile122_xmi	weiße Männer	CONTENDER_Gender	weiße Männer um 61 	[(24, 111)]	Negative	false
Testfile122_xmi	weiße Männer um 61	CONTENDER_Age	weiße Männer um 61 	[(24, 111)]	Negative	false
Testfile124_xmi	„Gold. Revue“	TEXT_General	des Monats	[(81, 91)]	Positive	false
Testfile124_xmi	Hörspiel	TEXT_Form	des Monats	[(81, 91)]	Positive	false
Testfile125_xmi	Gold. Revue	TEXT_General	des Monats	[(12, 22)]	Positive	false
Testfile128_xmi	büchnerpreis	META_Main-Event	Gänsehaut pur!,großes	[(28, 34), (2, 16)]	Positive	false
Testfile129_xmi	Büchnerpreisträger	CONTENDER_General	herausragender	[(55, 69)]	Positive	false
Testfile129_xmi	Büchnerpreisträger	META_Winner_Award-Ceremony	herausragender	[(55, 69)]	Positive	false
Testfile132_xmi	Büchnerpreis	META_Main-Event	Tatenvolumen!	[(2, 15)]	Positive	false
Testfile135_xmi	Er	CONTENDER_General	meisterhaft	[(18, 29)]	Positive	false
Testfile139_xmi	Gute Wahl! #GeorgBüchnerPreis	META_Winner_Award-Ceremony	Gute Wahl!	[(2, 12)]	Positive	true

technocrat · May 20, 2023, 10:12pm

Have you considered doing this in {tidytext} with the help of its online text? It's straightforward to make a corpus out of the csv and perform sentiment analysis and other NLP. Come back if that's no help, though.

ledgreve · May 21, 2023, 8:34am

Thank you very much for your reply! Like I said, my coding knowledge is extremely limited. When I look at a existing script I can usually get an idea what it's about, but that's about the extent of it; It's also the reason why I tried to rely on ChatGPT for the script creation.

I had not heard of tidytext yet. I've skimmed through the chapters, and while it seems like a very useful way to perform a sentiment analysis, at the moment I simply need a script to calculate the results of one which has already been performed by counting all instances.

EDIT: small change to sentence for clarification

technocrat · May 22, 2023, 9:02am

That's already the hardest part of f(x) = y, clearly stating the matter to be determined. Everything in R is an object, including operators, such as + - * / and all functions f.

x is what is to hand and y is what is to be derived with f.

To start, x is a set of csv files, and each is to be converted to a data frame for further treatment, selection of the three variables of interest. Bringing us to

What does the current code do, using the sample data, shown by a a reprex (see the FAQ).

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# variable names changed to remove need for back quotes
d <- data.frame(Document = c(
  "Testfile100_xmi", "Testfile101_xmi",
  "Testfile101_xmi", "Testfile103_xmi", "Testfile106_xmi", "Testfile109_xmi",
  "Testfile109_xmi", "Testfile10_xmi", "Testfile10_xmi", "Testfile110_xmi",
  "Testfile119_xmi", "Testfile121_xmi", "Testfile122_xmi", "Testfile122_xmi",
  "Testfile122_xmi", "Testfile124_xmi", "Testfile124_xmi", "Testfile125_xmi",
  "Testfile128_xmi", "Testfile129_xmi", "Testfile129_xmi", "Testfile132_xmi",
  "Testfile135_xmi", "Testfile139_xmi"
), Aspect = c(
  "Jan Wagner",
  "Gewinner des #Büchnerpreis", "er", "Scheißgedicht", "Lesung",
  "Alternative Büchnerpreis", "Wondratschek", "Büchnerpreisträger",
  "Büchnerpreisträger", "LITERARISCHEN WELT", "Jan Wagner", "Gedicht über den Giersch",
  "Büchnerpreis-Jury zeichnet", "weiße Männer", "weiße Männer um 61",
  "„Gold. Revue“", "Hörspiel", "Gold. Revue", "büchnerpreis",
  "Büchnerpreisträger", "Büchnerpreisträger", "Büchnerpreis",
  "Er", "Gute Wahl! #GeorgBüchnerPreis"
), Aspect_category = c(
  "CONTENDER_General",
  "META_Winner_Award-Ceremony", "CONTENDER_General", "TEXT_General",
  "READING_General", "META_Literature_Literary-Prizes", "ALLO-REFERENCES_TEXT_Other-Author",
  "META_Winner_Award-Ceremony", "CONTENDER_General", "META_Literature_Literary-Prizes",
  "CONTENDER_General", "TEXT_General-Content_Plot", "JURY_Discussion_Valuation",
  "CONTENDER_Gender", "CONTENDER_Age", "TEXT_General", "TEXT_Form",
  "TEXT_General", "META_Main-Event", "CONTENDER_General", "META_Winner_Award-Ceremony",
  "META_Main-Event", "CONTENDER_General", "META_Winner_Award-Ceremony"
), Polarity_Trigger = c(
  "Sprachkünstler", "Gratulation!", "Gratulation!",
  "Scheißgedicht", "konzertanten", "klotzen!", "Alternative Büchnerpreis sollte jedes Jahr an Wondratschek gehen",
  "ganz besonderen", "ganz besonderen", "geht auch nur in der LITERARISCHEN WELT",
  "the Billy Collins of contemporary German poetry", "sensationelles",
  "zeichnet im Schnitt weiße Männer um 61 aus, die bei den vier großen Verlagen erscheinen",
  "weiße Männer um 61 ", "weiße Männer um 61 ", "des Monats",
  "des Monats", "des Monats", "Gänsehaut pur!,großes", "herausragender",
  "herausragender", "Tatenvolumen!", "meisterhaft", "Gute Wahl!"
), Span_lore = c(
  "[(31, 45)]", "[(129, 141)]", "[(129, 141)]",
  "[(19, 32)]", "[(99, 111)]", "[(18, 26)]", "[(31, 95)]", "[(12, 27)]",
  "[(12, 27)]", "[(81, 120)]", "[(61, 108)]", "[(14, 28)]", "[(24, 111)]",
  "[(24, 111)]", "[(24, 111)]", "[(81, 91)]", "[(81, 91)]", "[(12, 22)]",
  "[(28, 34), (2, 16)]", "[(55, 69)]", "[(55, 69)]", "[(2, 15)]",
  "[(18, 29)]", "[(2, 12)]"
), Polarity_lore = c(
  "Positive", "Positive",
  "Positive", "Negative", "Positive", "Positive", "Positive", "Positive",
  "Positive", "Negative", "Positive", "Positive", "Negative", "Negative",
  "Negative", "Positive", "Positive", "Positive", "Positive", "Positive",
  "Positive", "Positive", "Positive", "Positive"
), Irony_lore = c(
  "false",
  "false", "false", "true", "false", "false", "false", "false",
  "false", "false", "false", "false", "false", "false", "false",
  "false", "false", "false", "false", "false", "false", "false",
  "false", "true"
))

# Filter relevant columns: Aspect_category, Polarity_lore, and Irony_lore
df_filtered <- d |>
  select(Aspect_category, Polarity_lore, Irony_lore)

# Calculate the counts for each category, sentiment, and irony combination
counts <- df_filtered |> 
  group_by(Aspect_category, Polarity_lore, Irony_lore) |>
  summarise(count = n()) |>
  ungroup()
#> `summarise()` has grouped output by 'Aspect_category', 'Polarity_lore'. You can
#> override using the `.groups` argument.
counts
#> # A tibble: 15 × 4
#>    Aspect_category                   Polarity_lore Irony_lore count
#>    <chr>                             <chr>         <chr>      <int>
#>  1 ALLO-REFERENCES_TEXT_Other-Author Positive      false          1
#>  2 CONTENDER_Age                     Negative      false          1
#>  3 CONTENDER_Gender                  Negative      false          1
#>  4 CONTENDER_General                 Positive      false          6
#>  5 JURY_Discussion_Valuation         Negative      false          1
#>  6 META_Literature_Literary-Prizes   Negative      false          1
#>  7 META_Literature_Literary-Prizes   Positive      false          1
#>  8 META_Main-Event                   Positive      false          2
#>  9 META_Winner_Award-Ceremony        Positive      false          3
#> 10 META_Winner_Award-Ceremony        Positive      true           1
#> 11 READING_General                   Positive      false          1
#> 12 TEXT_Form                         Positive      false          1
#> 13 TEXT_General                      Negative      true           1
#> 14 TEXT_General                      Positive      false          2
#> 15 TEXT_General-Content_Plot         Positive      false          1

^{Created on 2023-05-21 with reprex v2.0.2}

The result is arguably what we were looking for, but the arrangement isn't very useful, especially since several of these are going to be pasted together. When this happens, it usually help to design y first, and then work backwards. To begin let's look at the sample space—what is the set of possible outcomes?

There are four possibilities for each aspect with respect to sentiment and irony

POSITIVE_TRUE
POSITIVE_FALSE
NEGATIVE_TRUE
NEGATIVE_FALSE

To produce this table

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(gt)
library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test

d <- data.frame(Document = c(
  "Testfile100_xmi", "Testfile101_xmi",
  "Testfile101_xmi", "Testfile103_xmi", "Testfile106_xmi", "Testfile109_xmi",
  "Testfile109_xmi", "Testfile10_xmi", "Testfile10_xmi", "Testfile110_xmi",
  "Testfile119_xmi", "Testfile121_xmi", "Testfile122_xmi", "Testfile122_xmi",
  "Testfile122_xmi", "Testfile124_xmi", "Testfile124_xmi", "Testfile125_xmi",
  "Testfile128_xmi", "Testfile129_xmi", "Testfile129_xmi", "Testfile132_xmi",
  "Testfile135_xmi", "Testfile139_xmi"
), Aspect = c(
  "Jan Wagner",
  "Gewinner des #Büchnerpreis", "er", "Scheißgedicht", "Lesung",
  "Alternative Büchnerpreis", "Wondratschek", "Büchnerpreisträger",
  "Büchnerpreisträger", "LITERARISCHEN WELT", "Jan Wagner", "Gedicht über den Giersch",
  "Büchnerpreis-Jury zeichnet", "weiße Männer", "weiße Männer um 61",
  "„Gold. Revue“", "Hörspiel", "Gold. Revue", "büchnerpreis",
  "Büchnerpreisträger", "Büchnerpreisträger", "Büchnerpreis",
  "Er", "Gute Wahl! #GeorgBüchnerPreis"
), Aspect_category = c(
  "CONTENDER_General",
  "META_Winner_Award-Ceremony", "CONTENDER_General", "TEXT_General",
  "READING_General", "META_Literature_Literary-Prizes", "ALLO-REFERENCES_TEXT_Other-Author",
  "META_Winner_Award-Ceremony", "CONTENDER_General", "META_Literature_Literary-Prizes",
  "CONTENDER_General", "TEXT_General-Content_Plot", "JURY_Discussion_Valuation",
  "CONTENDER_Gender", "CONTENDER_Age", "TEXT_General", "TEXT_Form",
  "TEXT_General", "META_Main-Event", "CONTENDER_General", "META_Winner_Award-Ceremony",
  "META_Main-Event", "CONTENDER_General", "META_Winner_Award-Ceremony"
), Polarity_Trigger = c(
  "Sprachkünstler", "Gratulation!", "Gratulation!",
  "Scheißgedicht", "konzertanten", "klotzen!", "Alternative Büchnerpreis sollte jedes Jahr an Wondratschek gehen",
  "ganz besonderen", "ganz besonderen", "geht auch nur in der LITERARISCHEN WELT",
  "the Billy Collins of contemporary German poetry", "sensationelles",
  "zeichnet im Schnitt weiße Männer um 61 aus, die bei den vier großen Verlagen erscheinen",
  "weiße Männer um 61 ", "weiße Männer um 61 ", "des Monats",
  "des Monats", "des Monats", "Gänsehaut pur!,großes", "herausragender",
  "herausragender", "Tatenvolumen!", "meisterhaft", "Gute Wahl!"
), Span_lore = c(
  "[(31, 45)]", "[(129, 141)]", "[(129, 141)]",
  "[(19, 32)]", "[(99, 111)]", "[(18, 26)]", "[(31, 95)]", "[(12, 27)]",
  "[(12, 27)]", "[(81, 120)]", "[(61, 108)]", "[(14, 28)]", "[(24, 111)]",
  "[(24, 111)]", "[(24, 111)]", "[(81, 91)]", "[(81, 91)]", "[(12, 22)]",
  "[(28, 34), (2, 16)]", "[(55, 69)]", "[(55, 69)]", "[(2, 15)]",
  "[(18, 29)]", "[(2, 12)]"
), Polarity_lore = c(
  "Positive", "Positive",
  "Positive", "Negative", "Positive", "Positive", "Positive", "Positive",
  "Positive", "Negative", "Positive", "Positive", "Negative", "Negative",
  "Negative", "Positive", "Positive", "Positive", "Positive", "Positive",
  "Positive", "Positive", "Positive", "Positive"
), Irony_lore = c(
  "false",
  "false", "false", "true", "false", "false", "false", "false",
  "false", "false", "false", "false", "false", "false", "false",
  "false", "false", "false", "false", "false", "false", "false",
  "false", "true"
))

# Filter relevant columns: Aspect_category, Polarity_lore, and Irony_lore
df_filtered <- d |>
  select(Aspect_category, Polarity_lore, Irony_lore) 

df_filtered$Polarity_lore <- ifelse(df_filtered$Polarity_lore == "Positive",TRUE,FALSE)
df_filtered$Irony_lore <- ifelse(df_filtered$Irony_lore == "true",TRUE,FALSE)


case1 <- df_filtered[which(df_filtered$Polarity_lore == TRUE &
                           df_filtered$Irony_lore == TRUE),]
case2 <- df_filtered[which(df_filtered$Polarity_lore == TRUE &
                           df_filtered$Irony_lore == FALSE),]
case3 <- df_filtered[which(df_filtered$Polarity_lore == FALSE &
                           df_filtered$Irony_lore == TRUE),]
case4 <- df_filtered[which(df_filtered$Polarity_lore == FALSE &
                           df_filtered$Irony_lore == FALSE),]

case1 <- as.data.frame(tabyl(case1,Aspect_category,Polarity_lore,Irony_lore))
colnames(case1) <- c("Aspect_category", "case1")
case2 <- as.data.frame(tabyl(case2,Aspect_category,Polarity_lore,Irony_lore))
colnames(case2) <- c("Aspect_category", "case2")
case3 <- as.data.frame(tabyl(case3,Aspect_category,Polarity_lore,Irony_lore))
colnames(case3) <- c("Aspect_category","case3")
case4 <- as.data.frame(tabyl(case4,Aspect_category,Polarity_lore,Irony_lore))
colnames(case4) <- c("Aspect_category","case4")

cases <- full_join(case1,case2,by = "Aspect_category")
cases <- full_join(cases,case3,by = "Aspect_category")
cases <- full_join(cases,case4,by = "Aspect_category")

# presentation table
cases |>
  gt() |>
  cols_label(
    Aspect_category = "Aspect",
    case1 = "Positive sentiment, ironic",
    case2 = "Positive sentiment, unironic",
    case3 = "Negative sentiment, ironic",
    case4 = "Negative sentiment, unironic"
  ) |>
  sub_missing()

system · June 12, 2023, 9:03am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.