issue computing simple bar plot with counts over time

nbaes · February 28, 2023, 4:17am

Issue: Not sure why I keep getting errors when I try to compute a barplot of counts of something over time. My data set and code currently looks like this; any help would be very appreciated! :

library(ggplot2)

df <- tibble::tribble(
~year,     ~count,
1930        1
1931        2
1932        5
1933        6
1934        6
1935        9
1936        13
1937        15
1938        16
1939       19
1940        26
)

freq_abstracts_fig <- freq_abstracts_year |>
  ggplot(aes(x = year, y = count)) + 
  geom_bar() +
  theme_classic() +
  xlab("\nYear") +
  ylab("Frequency of Abstracts\n") 
print(freq_abstracts_fig)

Error: Error in geom_bar():
! Problem while computing stat.
Error occurred in the 1st layer.
Caused by error in setup_params():
! stat_count() must only have an x or y aesthetic.
Run rlang::last_error() to see where the error occurred.

FJCC · February 28, 2023, 4:51am

You need to use geom_col(). That takes an x and a y aesthetic and uses the value of y to set the bar height. The geom_bar() takes only x or y and counts how many times each value appears. I included such a plot as the second one in my example below. All of the bars have a height of one because each year appears once in your data.

library(ggplot2)

df <- tibble::tribble(
  ~year,     ~count,
  1930,        1,
  1931,        2,
  1932,        5,
  1933,        6,
  1934,        6,
  1935,        9,
  1936,        13,
  1937,        15,
  1938,        16,
  1939,       19,
  1940,        26,
)

freq_abstracts_fig <- df |>
  ggplot(aes(x = year, y = count)) + 
  geom_col() +
  theme_classic() +
  xlab("\nYear") +
  ylab("Frequency of Abstracts\n") 
print(freq_abstracts_fig)


df |> ggplot(aes(x = year)) + geom_bar()

^{Created on 2023-02-27 with reprex v2.0.2}

nbaes · February 28, 2023, 5:21am

Thank you very much! This makes a lot of sense, and I am wondering now why the graph is still not working?

freq_abstracts_fig <- freq_abstracts_year |>
  ggplot(aes(x = year, y = count)) + 
  geom_col() +
  theme_classic() +
  xlab("\nYear") +
  ylab("Frequency of Abstracts\n") 

print(freq_abstracts_fig)

FJCC · February 28, 2023, 6:23am

It seems one of your year values is nearly 6000 and there are others near 4000. Please post the output of

summary(freq_abstracts_year)

How many rows does your data set have? The chart suggest there are many rows, which might make a column plot impractical. A line plot might work better, once the data are cleaned up.

nbaes · February 28, 2023, 6:35am

Thank you!

> summary(freq_abstracts_year)

year count
Min. :1930 Length:86
1st Qu.:1952 Class :character
Median :1974 Mode :character
Mean :1973
3rd Qu.:1995
Max. :2016

Data set has 86 rows
Min for count = 1; max for count = 53325
The data are not that tidy (given the low sample in early years), however I am wanting to plot it, and perhaps truncate the x-axis if necessary; wishing to somewhat replicate this graph:

FJCC · February 28, 2023, 6:41am

Your count column has the class character. Run

NewDF  <- freq_abstracts_year
NewDF$count <- as.numeric(NewDF$count)

If you don't get any warnings or errors, try plotting NewDF. If you do get a warning or error, please post the output of

dput(freq_abstracts_year)

Put a line with three back ticks just before and after the output, like this
```
output of dput() goes here
```

nbaes · February 28, 2023, 6:56am

Thank you so much, @FJCC! I have learnt useful things here.

changing countto numeric worked, however I hadn't even realised that was the issue. Good to know that summary(df) is a good function to run if something is going wrong!

nbaes · February 28, 2023, 7:02am

Not sure why (have tried closing and restarting R), but there is now an error (and I have changed the naming of the new dataset slighytly):

freq_abstracts_year2  <- freq_abstracts_year
freq_abstracts_year2$count <- as.numeric(freq_abstracts_year2$count)

Warning message:
NAs introduced by coercion

NOTE: It seems to be dropping all values just before 1990

OUTPUT OF DPUT

dput(freq_abstracts_year)

structure(list(year = c(1930, 1931, 1932, 1933, 1934, 1936, 1937, 
1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 
1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 
1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 
1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 
1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 
1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 
2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 
2015, 2016), count = c("    1", "    2", "    1", "    6", 
"    1", "   20", "   31", "   19", "   18", "   35", 
"   39", "   37", "   37", "   33", "   41", "   41", 
"   30", "   16", "   26", "   36", "   31", "   34", 
"   53", "   54", "   59", "  109", "  117", "  118", 
"  126", "  209", "  227", "  358", "  773", " 1013", " 1301", 
" 1274", " 1481", " 1914", " 1915", " 2402", " 2806", " 2944", 
" 3618", " 3953", " 4512", " 5236", " 5596", " 6021", " 6318", 
" 6757", " 7106", " 7455", " 7780", " 8169", " 8256", " 8935", 
" 9158", " 9812", "11248", "11526", "11713", "12215", "12919", 
"13823", "14680", "15754", "16085", "15468", "15479", "17021", 
"17661", "18432", "19699", "21666", "22338", "25575", "28061", 
"29130", "31238", "33273", "36172", "37883", "42136", "43907", 
"45256", "53325")), row.names = c(NA, -86L), class = c("tbl_df", 
"tbl", "data.frame"))

nirgrahamuk · February 28, 2023, 9:23am

but this behaviour is not shown when running your code on this example data you provided ?

I will give you a possible (one of many) approaches to investigating issues.

#example of using na.action to locate where your nas happened.
# if you know where they happened you can look at what they tried to do 
# and gain understanding

(df_1 <- data.frame(counts = c("1","x","3")) )
(df_1$numcounts <- as.numeric(df_1$counts))

(rows_where_bad <- na.action(na.omit(df_1)))

df_1[rows_where_bad,]

nbaes · February 28, 2023, 11:42pm

Hi @nirgrahamuk ! This is a better example of what is happening (below). It occurs right after I change the count column to as.numeric with values ranging from 1: 2402 (forcing NA values from counts in 1930:1970).

> dput(freq_abstracts_year2)

structure(list(year = c(1930, 1931, 1932, 1933, 1934, 1936, 1937,
1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948,
1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959,
1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970,
1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,
1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992,
1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
2015, 2016), count = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
11248, 11526, 11713, 12215, 12919, 13823, 14680, 15754, 16085,
15468, 15479, 17021, 17661, 18432, 19699, 21666, 22338, 25575,
28061, 29130, 31238, 33273, 36172, 37883, 42136, 43907, 45256,
53325)), row.names = c(NA, -86L), class = c("tbl_df", "tbl",
"data.frame"))

I cannot see a pattern, or think why; would you know how to troubleshoot from here?

nirgrahamuk · March 1, 2023, 12:22am

Youve shown the post transformed values where NA resulted rather than a number, but this is not directly informative. You need to find examples of the sort of character strings that represent numbers in your data but fail the as.numeric() operation.

It seems that freq_abstracts_year is your original and you apply as numeric and make freq_abstracts_year2, so you need look at freq_abstracts_year2 to determine the problematic rows and then look for those rows as they were in freq_abstracts_year.

If you should something like "123x" instead of "123" , that would be a clear explanation of the problem.

nbaes · March 1, 2023, 12:41am

I see what you mean now, @nirgrahamuk - thanks.

I ran the code you recommended above to yield a data frame with an extra column (numcounts) for what happens after the as.numeric transformation, and this is it:

(freq_abstracts_year$numcounts <- as.numeric(freq_abstracts_year$count))

(rows_where_bad <- na.action(na.omit(freq_abstracts_year)))

freq_abstracts_year[rows_where_bad,]

dput(freq_abstracts_year)

structure(list(year = c(1930, 1931, 1932, 1933, 1934, 1936, 1937,
1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948,
1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959,
1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970,
1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,
1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992,
1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
2015, 2016), count = c(" 1", " 2", " 1", " 6",
" 1", " 20", " 31", " 19", " 18", " 35",
" 39", " 37", " 37", " 33", " 41", " 41",
" 30", " 16", " 26", " 36", " 31", " 34",
" 53", " 54", " 59", " 109", " 117", " 118",
" 126", " 209", " 227", " 358", " 773", " 1013", " 1301",
" 1274", " 1481", " 1914", " 1915", " 2402", " 2806", " 2944",
" 3618", " 3953", " 4512", " 5236", " 5596", " 6021", " 6318",
" 6757", " 7106", " 7455", " 7780", " 8169", " 8256", " 8935",
" 9158", " 9812", "11248", "11526", "11713", "12215", "12919",
"13823", "14680", "15754", "16085", "15468", "15479", "17021",
"17661", "18432", "19699", "21666", "22338", "25575", "28061",
"29130", "31238", "33273", "36172", "37883", "42136", "43907",
"45256", "53325"), numcounts = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 11248, 11526, 11713, 12215, 12919, 13823, 14680,
15754, 16085, 15468, 15479, 17021, 17661, 18432, 19699, 21666,
22338, 25575, 28061, 29130, 31238, 33273, 36172, 37883, 42136,
43907, 45256, 53325)), row.names = c(NA, -86L), class = c("tbl_df",
"tbl", "data.frame"))

image1:

image2:

FJCC · March 1, 2023, 1:24am

You can use trimws() to get rid of the leading spaces in the countcolumn.

DF <- structure(list(year = c(1930, 1931, 1932, 1933, 1934, 1936, 1937,
                        1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948,
                        1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959,
                        1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970,
                        1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,
                        1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992,
                        1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
                        2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
                        2015, 2016), 
               count = c(" 1", " 2", " 1", " 6",
                         " 1", " 20", " 31", " 19", " 18", " 35",
                         " 39", " 37", " 37", " 33", " 41", " 41",
                         " 30", " 16", " 26", " 36", " 31", " 34",
                         " 53", " 54", " 59", " 109", " 117", " 118",
                         " 126", " 209", " 227", " 358", " 773", " 1013", " 1301",
                         " 1274", " 1481", " 1914", " 1915", " 2402", " 2806", " 2944",
                         " 3618", " 3953", " 4512", " 5236", " 5596", " 6021", " 6318",
                         " 6757", " 7106", " 7455", " 7780", " 8169", " 8256", " 8935",
                         " 9158", " 9812", "11248", "11526", "11713", "12215", "12919",
                         "13823", "14680", "15754", "16085", "15468", "15479", "17021",
                         "17661", "18432", "19699", "21666", "22338", "25575", "28061",
                         "29130", "31238", "33273", "36172", "37883", "42136", "43907",
                         "45256", "53325"), 
               numcounts = c(NA, NA, NA, NA, NA, NA, NA,
                             NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
                             NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
                             NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
                             NA, NA, NA, 11248, 11526, 11713, 12215, 12919, 13823, 14680,
                             15754, 16085, 15468, 15479, 17021, 17661, 18432, 19699, 21666,
                             22338, 25575, 28061, 29130, 31238, 33273, 36172, 37883, 42136,
                             43907, 45256, 53325)), 
          row.names = c(NA, -86L), class = c("tbl_df","tbl", "data.frame"))

DF$numcounts2 <- as.numeric(trimws(DF$count))
summary(DF)    
#>       year         count             numcounts       numcounts2   
#>  Min.   :1930   Length:86          Min.   :11248   Min.   :    1  
#>  1st Qu.:1952   Class :character   1st Qu.:15271   1st Qu.:   44  
#>  Median :1974   Mode  :character   Median :19066   Median : 3786  
#>  Mean   :1973                      Mean   :24060   Mean   : 9327  
#>  3rd Qu.:1995                      3rd Qu.:31747   3rd Qu.:14466  
#>  Max.   :2016                      Max.   :53325   Max.   :53325  
#>                                    NA's   :58

^{Created on 2023-02-28 with reprex v2.0.2}

nbaes · March 1, 2023, 1:52am

Wow! That is exactly the problem that I completely missed. Thank you so much @FJCC (and @nirgrahamuk ). I really appreciate it.

nbaes · March 1, 2023, 2:06am

@FJCC I am having some issues applying this function to the count column in "freq_abstracts_year". Would you know why the below is not working? As in, the same error (with the NAs) results after line 3.

freq_abstracts_year2  <- freq_abstracts_year
freq_abstracts_year2$count <- trimws(freq_abstracts_year2$count, "l")
freq_abstracts_year2$count <- as.numeric(freq_abstracts_year2$count)

FJCC · March 1, 2023, 2:13am

I do not see a problem with your code.
Please run the first two lines and then post the output of

dput(freq_abstracts_year2)

nbaes · March 1, 2023, 2:14am

Thanks! To me, it looks like it added spaces rather than removing them on the left?

> dput(freq_abstracts_year2)
structure(list(year = c(1930, 1931, 1932, 1933, 1934, 1936, 1937, 
1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 
1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 
1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 
1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 
1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 
1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 
2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 
2015, 2016), count = c("    1", "    2", "    1", "    6", 
"    1", "   20", "   31", "   19", "   18", "   35", 
"   39", "   37", "   37", "   33", "   41", "   41", 
"   30", "   16", "   26", "   36", "   31", "   34", 
"   53", "   54", "   59", "  109", "  117", "  118", 
"  126", "  209", "  227", "  358", "  773", " 1013", " 1301", 
" 1274", " 1481", " 1914", " 1915", " 2402", " 2806", " 2944", 
" 3618", " 3953", " 4512", " 5236", " 5596", " 6021", " 6318", 
" 6757", " 7106", " 7455", " 7780", " 8169", " 8256", " 8935", 
" 9158", " 9812", "11248", "11526", "11713", "12215", "12919", 
"13823", "14680", "15754", "16085", "15468", "15479", "17021", 
"17661", "18432", "19699", "21666", "22338", "25575", "28061", 
"29130", "31238", "33273", "36172", "37883", "42136", "43907", 
"45256", "53325")), row.names = c(NA, -86L), class = c("tbl_df", 
"tbl", "data.frame"))

FJCC · March 1, 2023, 2:19am

Using your latest dput() output, I do not get any warinings.

freq_abstracts_year2 <- structure(list(year = c(1930, 1931, 1932, 1933, 1934, 1936, 1937, 
                        1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 
                        1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 
                        1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 
                        1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 
                        1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 
                        1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 
                        2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 
                        2015, 2016), 
               count = c("    1", "    2", "    1", "    6", 
                                               "    1", "   20", "   31", "   19", "   18", "   35", 
                                               "   39", "   37", "   37", "   33", "   41", "   41", 
                                               "   30", "   16", "   26", "   36", "   31", "   34", 
                                               "   53", "   54", "   59", "  109", "  117", "  118", 
                                               "  126", "  209", "  227", "  358", "  773", " 1013", " 1301", 
                                               " 1274", " 1481", " 1914", " 1915", " 2402", " 2806", " 2944", 
                                               " 3618", " 3953", " 4512", " 5236", " 5596", " 6021", " 6318", 
                                               " 6757", " 7106", " 7455", " 7780", " 8169", " 8256", " 8935", 
                                               " 9158", " 9812", "11248", "11526", "11713", "12215", "12919", 
                                               "13823", "14680", "15754", "16085", "15468", "15479", "17021", 
                                               "17661", "18432", "19699", "21666", "22338", "25575", "28061", 
                                               "29130", "31238", "33273", "36172", "37883", "42136", "43907", 
                                               "45256", "53325")), 
          row.names = c(NA, -86L), class = c("tbl_df", 
                                                                                                      "tbl", "data.frame"))
freq_abstracts_year2$count <- trimws(freq_abstracts_year2$count, "l")
freq_abstracts_year2$count <- as.numeric(freq_abstracts_year2$count)
summary(freq_abstracts_year2)
#>       year          count      
#>  Min.   :1930   Min.   :    1  
#>  1st Qu.:1952   1st Qu.:   44  
#>  Median :1974   Median : 3786  
#>  Mean   :1973   Mean   : 9327  
#>  3rd Qu.:1995   3rd Qu.:14466  
#>  Max.   :2016   Max.   :53325

^{Created on 2023-02-28 with reprex v2.0.2}
Try copying my code above, including the structure() and running it.

nbaes · March 1, 2023, 2:25am

Very strange! It worked, and the graph is finally displaying the 1930 values onwards. I tried restarting R several times too.

nirgrahamuk · March 1, 2023, 10:27am

Its not clear to me whether your issue; is truly solved; or whether you consider it solved since the dput() version of the data 'cleaned it up'....
I think the data source you had, may have been problematic; and dput() is secretly 'cleaning up'
Here are some thoughts with code in R

#normal space would be fine 
as.numeric(" 123")

hiding <- "\u00A0123" # a unicode symbol for a non-breaking space character ...
print(hiding)
#s.numeric should itself ignore trailing and leading whitespace anyway...
# as shown as my first example; using trimws again doesnt help :( 
as.numeric(trimws(hiding))

#does this show in dput ?
dput(hiding) # no; dput does not reveal the 'true' data which has the unicode symbol; we get a cleaned up version

#to me this is evidence that your data has some similar sort of 'pollution' of unicode symbols
# that appears to the eye as a conventional space but is not simply that.

# readr can cleverly deal with it...(I havent investigated how)
readr::parse_number(hiding)

#in base R - using iconv to go from unicode to a smaller strict ascii set might solve
as.numeric(iconv(hiding, to = "ASCII//TRANSLIT"))
# 
# from chatGPT : 
#   ASCII//TRANSLIT is a character encoding that maps non-ASCII characters to their closest ASCII equivalents using a transliteration algorithm. The ASCII part of the encoding specifies that the output should only contain ASCII characters (i.e., those with code points in the range 0-127), and the TRANSLIT part specifies that non-ASCII characters should be replaced with the closest ASCII equivalent where possible.
# 
# For example, the non-breaking space character (U+00A0) would be transliterated to a regular space (U+0020) in the ASCII//TRANSLIT encoding, as they are similar in function and appearance.