Problem creating a script to calculate/summarise certain values from a csv file

Hello everyone,

I have a bunch of csv-files (one per corpus) containing annotated data in multiple columns. I wish to look at the results of an aspect based sentiment analysis. Unfortunately it's a lot of different files and outputs to calculate manually. It would be very helpful if I could use a script to automatically do the calculations instead of using the filter.

The first type of document simply contains the annotated aspects: the name of the annotated file ("Document"), the annotated target wordgroup ("Aspect"), and the aspect category ("lore"). Here I would like to know how many of each category there are.

The second type of document contains my annotated data combining aspect and seniment. The only relevant columns are "Aspect category", "Polarity lore" and "Irony lore". The first column contains the category labels which have been assigned to the aspects, such as "Text". The second column assigns a sentiment label to the aspect, namely "Positive" or "Negative" (sometimes there is failed label). The last column tells you wether the sentiment is meant ironically ("true" or "false"). I wish to know how often each aspect category is mentioned in combination with either positive or negative sentiment and how many of these positive or negative sentiments were meant ironically.

Thirdly, I have files containing automatically annotated data, with several columns. The only relevant ones are a column for the aspect category ("Category_Fine") and for the pertaining sentiment ("Polarity").

I wanted to try my hand at the second one first since it seemed the most complicated one. My coding knowledge, however, is very limited. My colleague recommended creating a script via chatGPT . No true surprizes there, but it did not work. Of course, it couldn't hurt to try. I would be very grateful if anyone would be willing and have the time to help.

I will included the failing script below, as it might be of some help? I also have an example csv-file containing only some 25 lines to test it on, but am not sure how I can include it here, so I'll just add the lines at the end, I guess?

library(tidyverse)

# Set the path to your CSV files directory
csv_directory <- "path/to/csv/files"

# Get a list of all CSV files in the directory
csv_files <- list.files(csv_directory, pattern = "\\.csv$", full.names = TRUE)

# Initialize an empty data frame to store the results
results <- data.frame(Category = character(), Positive = integer(), Negative = integer(), Irony_True = integer(), Irony_False = integer(), stringsAsFactors = FALSE)

# Loop through each CSV file
for (csv_file in csv_files) {
  # Read the CSV file into a data frame
  df <- read_csv(csv_file)
  
  # Filter relevant columns: Aspect category, Polarity lore, and Irony lore
  df_filtered <- df %>% 
    select(`Aspect category`, `Polarity lore`, `Irony lore`)
  
  # Calculate the counts for each category, sentiment, and irony combination
  counts <- df_filtered %>% 
    group_by(`Aspect category`, `Polarity lore`, `Irony lore`) %>% 
    summarise(count = n()) %>% 
    ungroup()
  
  # Add the counts to the results data frame
  results <- bind_rows(results, counts)
}

# Write the results to a new CSV file
write_csv(results, "output_file.csv")

Thank you in advance for your time, patience and help!

The example data (tab delimiter):

Document	Aspect	Aspect category	Polarity Trigger	Span lore	Polarity lore	Irony lore
Testfile100_xmi	Jan Wagner	CONTENDER_General	Sprachk├╝nstler	[(31, 45)]	Positive	false
Testfile101_xmi	Gewinner des #B├╝chnerpreis	META_Winner_Award-Ceremony	Gratulation!	[(129, 141)]	Positive	false
Testfile101_xmi	er	CONTENDER_General	Gratulation!	[(129, 141)]	Positive	false
Testfile103_xmi	Schei├čgedicht	TEXT_General	Schei├čgedicht	[(19, 32)]	Negative	true
Testfile106_xmi	Lesung	READING_General	konzertanten	[(99, 111)]	Positive	false
Testfile109_xmi	Alternative B├╝chnerpreis	META_Literature_Literary-Prizes	klotzen!	[(18, 26)]	Positive	false
Testfile109_xmi	Wondratschek	ALLO-REFERENCES_TEXT_Other-Author	Alternative B├╝chnerpreis sollte jedes Jahr an Wondratschek gehen	[(31, 95)]	Positive	false
Testfile10_xmi	B├╝chnerpreistr├Ąger	META_Winner_Award-Ceremony	ganz besonderen	[(12, 27)]	Positive	false
Testfile10_xmi	B├╝chnerpreistr├Ąger	CONTENDER_General	ganz besonderen	[(12, 27)]	Positive	false
Testfile110_xmi	LITERARISCHEN WELT	META_Literature_Literary-Prizes	geht auch nur in der LITERARISCHEN WELT	[(81, 120)]	Negative	false
Testfile119_xmi	Jan Wagner	CONTENDER_General	the Billy Collins of contemporary German poetry	[(61, 108)]	Positive	false
Testfile121_xmi	Gedicht ├╝ber den Giersch	TEXT_General-Content_Plot	sensationelles	[(14, 28)]	Positive	false
Testfile122_xmi	B├╝chnerpreis-Jury zeichnet	JURY_Discussion_Valuation	zeichnet im Schnitt wei├če M├Ąnner um 61 aus, die bei den vier gro├čen Verlagen erscheinen	[(24, 111)]	Negative	false
Testfile122_xmi	wei├če M├Ąnner	CONTENDER_Gender	wei├če M├Ąnner um 61 	[(24, 111)]	Negative	false
Testfile122_xmi	wei├če M├Ąnner um 61	CONTENDER_Age	wei├če M├Ąnner um 61 	[(24, 111)]	Negative	false
Testfile124_xmi	ÔÇ×Gold. RevueÔÇť	TEXT_General	des Monats	[(81, 91)]	Positive	false
Testfile124_xmi	H├Ârspiel	TEXT_Form	des Monats	[(81, 91)]	Positive	false
Testfile125_xmi	Gold. Revue	TEXT_General	des Monats	[(12, 22)]	Positive	false
Testfile128_xmi	b├╝chnerpreis	META_Main-Event	G├Ąnsehaut pur!,gro├čes	[(28, 34), (2, 16)]	Positive	false
Testfile129_xmi	B├╝chnerpreistr├Ąger	CONTENDER_General	herausragender	[(55, 69)]	Positive	false
Testfile129_xmi	B├╝chnerpreistr├Ąger	META_Winner_Award-Ceremony	herausragender	[(55, 69)]	Positive	false
Testfile132_xmi	B├╝chnerpreis	META_Main-Event	Tatenvolumen!	[(2, 15)]	Positive	false
Testfile135_xmi	Er	CONTENDER_General	meisterhaft	[(18, 29)]	Positive	false
Testfile139_xmi	Gute Wahl! #GeorgB├╝chnerPreis	META_Winner_Award-Ceremony	Gute Wahl!	[(2, 12)]	Positive	true

Have you considered doing this in {tidytext} with the help of its online text? It's straightforward to make a corpus out of the csv and perform sentiment analysis and other NLP. Come back if that's no help, though.

Thank you very much for your reply! Like I said, my coding knowledge is extremely limited. When I look at a existing script I can usually get an idea what it's about, but that's about the extent of it; It's also the reason why I tried to rely on ChatGPT for the script creation.

I had not heard of tidytext yet. I've skimmed through the chapters, and while it seems like a very useful way to perform a sentiment analysis, at the moment I simply need a script to calculate the results of one which has already been performed by counting all instances.

EDIT: small change to sentence for clarification

1 Like

That's already the hardest part of f(x) = y, clearly stating the matter to be determined. Everything in R is an object, including operators, such as + - * / and all functions f.

x is what is to hand and y is what is to be derived with f.

To start, x is a set of csv files, and each is to be converted to a data frame for further treatment, selection of the three variables of interest. Bringing us to

What does the current code do, using the sample data, shown by a a reprex (see the FAQ).

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# variable names changed to remove need for back quotes
d <- data.frame(Document = c(
  "Testfile100_xmi", "Testfile101_xmi",
  "Testfile101_xmi", "Testfile103_xmi", "Testfile106_xmi", "Testfile109_xmi",
  "Testfile109_xmi", "Testfile10_xmi", "Testfile10_xmi", "Testfile110_xmi",
  "Testfile119_xmi", "Testfile121_xmi", "Testfile122_xmi", "Testfile122_xmi",
  "Testfile122_xmi", "Testfile124_xmi", "Testfile124_xmi", "Testfile125_xmi",
  "Testfile128_xmi", "Testfile129_xmi", "Testfile129_xmi", "Testfile132_xmi",
  "Testfile135_xmi", "Testfile139_xmi"
), Aspect = c(
  "Jan Wagner",
  "Gewinner des #B├╝chnerpreis", "er", "Schei├čgedicht", "Lesung",
  "Alternative B├╝chnerpreis", "Wondratschek", "B├╝chnerpreistr├Ąger",
  "B├╝chnerpreistr├Ąger", "LITERARISCHEN WELT", "Jan Wagner", "Gedicht ├╝ber den Giersch",
  "B├╝chnerpreis-Jury zeichnet", "wei├če M├Ąnner", "wei├če M├Ąnner um 61",
  "ÔÇ×Gold. RevueÔÇť", "H├Ârspiel", "Gold. Revue", "b├╝chnerpreis",
  "B├╝chnerpreistr├Ąger", "B├╝chnerpreistr├Ąger", "B├╝chnerpreis",
  "Er", "Gute Wahl! #GeorgB├╝chnerPreis"
), Aspect_category = c(
  "CONTENDER_General",
  "META_Winner_Award-Ceremony", "CONTENDER_General", "TEXT_General",
  "READING_General", "META_Literature_Literary-Prizes", "ALLO-REFERENCES_TEXT_Other-Author",
  "META_Winner_Award-Ceremony", "CONTENDER_General", "META_Literature_Literary-Prizes",
  "CONTENDER_General", "TEXT_General-Content_Plot", "JURY_Discussion_Valuation",
  "CONTENDER_Gender", "CONTENDER_Age", "TEXT_General", "TEXT_Form",
  "TEXT_General", "META_Main-Event", "CONTENDER_General", "META_Winner_Award-Ceremony",
  "META_Main-Event", "CONTENDER_General", "META_Winner_Award-Ceremony"
), Polarity_Trigger = c(
  "Sprachk├╝nstler", "Gratulation!", "Gratulation!",
  "Schei├čgedicht", "konzertanten", "klotzen!", "Alternative B├╝chnerpreis sollte jedes Jahr an Wondratschek gehen",
  "ganz besonderen", "ganz besonderen", "geht auch nur in der LITERARISCHEN WELT",
  "the Billy Collins of contemporary German poetry", "sensationelles",
  "zeichnet im Schnitt wei├če M├Ąnner um 61 aus, die bei den vier gro├čen Verlagen erscheinen",
  "wei├če M├Ąnner um 61 ", "wei├če M├Ąnner um 61 ", "des Monats",
  "des Monats", "des Monats", "G├Ąnsehaut pur!,gro├čes", "herausragender",
  "herausragender", "Tatenvolumen!", "meisterhaft", "Gute Wahl!"
), Span_lore = c(
  "[(31, 45)]", "[(129, 141)]", "[(129, 141)]",
  "[(19, 32)]", "[(99, 111)]", "[(18, 26)]", "[(31, 95)]", "[(12, 27)]",
  "[(12, 27)]", "[(81, 120)]", "[(61, 108)]", "[(14, 28)]", "[(24, 111)]",
  "[(24, 111)]", "[(24, 111)]", "[(81, 91)]", "[(81, 91)]", "[(12, 22)]",
  "[(28, 34), (2, 16)]", "[(55, 69)]", "[(55, 69)]", "[(2, 15)]",
  "[(18, 29)]", "[(2, 12)]"
), Polarity_lore = c(
  "Positive", "Positive",
  "Positive", "Negative", "Positive", "Positive", "Positive", "Positive",
  "Positive", "Negative", "Positive", "Positive", "Negative", "Negative",
  "Negative", "Positive", "Positive", "Positive", "Positive", "Positive",
  "Positive", "Positive", "Positive", "Positive"
), Irony_lore = c(
  "false",
  "false", "false", "true", "false", "false", "false", "false",
  "false", "false", "false", "false", "false", "false", "false",
  "false", "false", "false", "false", "false", "false", "false",
  "false", "true"
))

# Filter relevant columns: Aspect_category, Polarity_lore, and Irony_lore
df_filtered <- d |>
  select(Aspect_category, Polarity_lore, Irony_lore)

# Calculate the counts for each category, sentiment, and irony combination
counts <- df_filtered |> 
  group_by(Aspect_category, Polarity_lore, Irony_lore) |>
  summarise(count = n()) |>
  ungroup()
#> `summarise()` has grouped output by 'Aspect_category', 'Polarity_lore'. You can
#> override using the `.groups` argument.
counts
#> # A tibble: 15 ├Ś 4
#>    Aspect_category                   Polarity_lore Irony_lore count
#>    <chr>                             <chr>         <chr>      <int>
#>  1 ALLO-REFERENCES_TEXT_Other-Author Positive      false          1
#>  2 CONTENDER_Age                     Negative      false          1
#>  3 CONTENDER_Gender                  Negative      false          1
#>  4 CONTENDER_General                 Positive      false          6
#>  5 JURY_Discussion_Valuation         Negative      false          1
#>  6 META_Literature_Literary-Prizes   Negative      false          1
#>  7 META_Literature_Literary-Prizes   Positive      false          1
#>  8 META_Main-Event                   Positive      false          2
#>  9 META_Winner_Award-Ceremony        Positive      false          3
#> 10 META_Winner_Award-Ceremony        Positive      true           1
#> 11 READING_General                   Positive      false          1
#> 12 TEXT_Form                         Positive      false          1
#> 13 TEXT_General                      Negative      true           1
#> 14 TEXT_General                      Positive      false          2
#> 15 TEXT_General-Content_Plot         Positive      false          1

Created on 2023-05-21 with reprex v2.0.2

The result is arguably what we were looking for, but the arrangement isn't very useful, especially since several of these are going to be pasted together. When this happens, it usually help to design y first, and then work backwards. To begin let's look at the sample spaceÔÇöwhat is the set of possible outcomes?

There are four possibilities for each aspect with respect to sentiment and irony

  • POSITIVE_TRUE
  • POSITIVE_FALSE
  • NEGATIVE_TRUE
  • NEGATIVE_FALSE

To produce this table

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(gt)
library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test

d <- data.frame(Document = c(
  "Testfile100_xmi", "Testfile101_xmi",
  "Testfile101_xmi", "Testfile103_xmi", "Testfile106_xmi", "Testfile109_xmi",
  "Testfile109_xmi", "Testfile10_xmi", "Testfile10_xmi", "Testfile110_xmi",
  "Testfile119_xmi", "Testfile121_xmi", "Testfile122_xmi", "Testfile122_xmi",
  "Testfile122_xmi", "Testfile124_xmi", "Testfile124_xmi", "Testfile125_xmi",
  "Testfile128_xmi", "Testfile129_xmi", "Testfile129_xmi", "Testfile132_xmi",
  "Testfile135_xmi", "Testfile139_xmi"
), Aspect = c(
  "Jan Wagner",
  "Gewinner des #B├╝chnerpreis", "er", "Schei├čgedicht", "Lesung",
  "Alternative B├╝chnerpreis", "Wondratschek", "B├╝chnerpreistr├Ąger",
  "B├╝chnerpreistr├Ąger", "LITERARISCHEN WELT", "Jan Wagner", "Gedicht ├╝ber den Giersch",
  "B├╝chnerpreis-Jury zeichnet", "wei├če M├Ąnner", "wei├če M├Ąnner um 61",
  "ÔÇ×Gold. RevueÔÇť", "H├Ârspiel", "Gold. Revue", "b├╝chnerpreis",
  "B├╝chnerpreistr├Ąger", "B├╝chnerpreistr├Ąger", "B├╝chnerpreis",
  "Er", "Gute Wahl! #GeorgB├╝chnerPreis"
), Aspect_category = c(
  "CONTENDER_General",
  "META_Winner_Award-Ceremony", "CONTENDER_General", "TEXT_General",
  "READING_General", "META_Literature_Literary-Prizes", "ALLO-REFERENCES_TEXT_Other-Author",
  "META_Winner_Award-Ceremony", "CONTENDER_General", "META_Literature_Literary-Prizes",
  "CONTENDER_General", "TEXT_General-Content_Plot", "JURY_Discussion_Valuation",
  "CONTENDER_Gender", "CONTENDER_Age", "TEXT_General", "TEXT_Form",
  "TEXT_General", "META_Main-Event", "CONTENDER_General", "META_Winner_Award-Ceremony",
  "META_Main-Event", "CONTENDER_General", "META_Winner_Award-Ceremony"
), Polarity_Trigger = c(
  "Sprachk├╝nstler", "Gratulation!", "Gratulation!",
  "Schei├čgedicht", "konzertanten", "klotzen!", "Alternative B├╝chnerpreis sollte jedes Jahr an Wondratschek gehen",
  "ganz besonderen", "ganz besonderen", "geht auch nur in der LITERARISCHEN WELT",
  "the Billy Collins of contemporary German poetry", "sensationelles",
  "zeichnet im Schnitt wei├če M├Ąnner um 61 aus, die bei den vier gro├čen Verlagen erscheinen",
  "wei├če M├Ąnner um 61 ", "wei├če M├Ąnner um 61 ", "des Monats",
  "des Monats", "des Monats", "G├Ąnsehaut pur!,gro├čes", "herausragender",
  "herausragender", "Tatenvolumen!", "meisterhaft", "Gute Wahl!"
), Span_lore = c(
  "[(31, 45)]", "[(129, 141)]", "[(129, 141)]",
  "[(19, 32)]", "[(99, 111)]", "[(18, 26)]", "[(31, 95)]", "[(12, 27)]",
  "[(12, 27)]", "[(81, 120)]", "[(61, 108)]", "[(14, 28)]", "[(24, 111)]",
  "[(24, 111)]", "[(24, 111)]", "[(81, 91)]", "[(81, 91)]", "[(12, 22)]",
  "[(28, 34), (2, 16)]", "[(55, 69)]", "[(55, 69)]", "[(2, 15)]",
  "[(18, 29)]", "[(2, 12)]"
), Polarity_lore = c(
  "Positive", "Positive",
  "Positive", "Negative", "Positive", "Positive", "Positive", "Positive",
  "Positive", "Negative", "Positive", "Positive", "Negative", "Negative",
  "Negative", "Positive", "Positive", "Positive", "Positive", "Positive",
  "Positive", "Positive", "Positive", "Positive"
), Irony_lore = c(
  "false",
  "false", "false", "true", "false", "false", "false", "false",
  "false", "false", "false", "false", "false", "false", "false",
  "false", "false", "false", "false", "false", "false", "false",
  "false", "true"
))

# Filter relevant columns: Aspect_category, Polarity_lore, and Irony_lore
df_filtered <- d |>
  select(Aspect_category, Polarity_lore, Irony_lore) 

df_filtered$Polarity_lore <- ifelse(df_filtered$Polarity_lore == "Positive",TRUE,FALSE)
df_filtered$Irony_lore <- ifelse(df_filtered$Irony_lore == "true",TRUE,FALSE)


case1 <- df_filtered[which(df_filtered$Polarity_lore == TRUE &
                           df_filtered$Irony_lore == TRUE),]
case2 <- df_filtered[which(df_filtered$Polarity_lore == TRUE &
                           df_filtered$Irony_lore == FALSE),]
case3 <- df_filtered[which(df_filtered$Polarity_lore == FALSE &
                           df_filtered$Irony_lore == TRUE),]
case4 <- df_filtered[which(df_filtered$Polarity_lore == FALSE &
                           df_filtered$Irony_lore == FALSE),]

case1 <- as.data.frame(tabyl(case1,Aspect_category,Polarity_lore,Irony_lore))
colnames(case1) <- c("Aspect_category", "case1")
case2 <- as.data.frame(tabyl(case2,Aspect_category,Polarity_lore,Irony_lore))
colnames(case2) <- c("Aspect_category", "case2")
case3 <- as.data.frame(tabyl(case3,Aspect_category,Polarity_lore,Irony_lore))
colnames(case3) <- c("Aspect_category","case3")
case4 <- as.data.frame(tabyl(case4,Aspect_category,Polarity_lore,Irony_lore))
colnames(case4) <- c("Aspect_category","case4")

cases <- full_join(case1,case2,by = "Aspect_category")
cases <- full_join(cases,case3,by = "Aspect_category")
cases <- full_join(cases,case4,by = "Aspect_category")

# presentation table
cases |>
  gt() |>
  cols_label(
    Aspect_category = "Aspect",
    case1 = "Positive sentiment, ironic",
    case2 = "Positive sentiment, unironic",
    case3 = "Negative sentiment, ironic",
    case4 = "Negative sentiment, unironic"
  ) |>
  sub_missing()

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.