Trouble plotting data

Hi, I'm new at using R, and am having trouble with plotting a public data set. I'm using the HIV/AIDS Diagnosis by Neighborhood, Sex, and Race/Ethnicity (https://data.cityofnewyork.us/Health/HIV-AIDS-Diagnoses-by-Neighborhood-Sex-and-Race-Et/ykvb-493p/about_data) but can't seem to get a graph. I used:
#ggplot(data = HIV_AIDS_Diagnoses_by_Neighborhood_Sex_and_Race_Ethnicity_20240131, mapping = aes(x = "Total Number of HIV Diagnoses", y ="Total Number of AIDS Diagnoses" ))+ geom_point() + geom_line()

Any help would be greatly appreciated!

The two columns you are trying to plot have a few rows containing an asterisk. This forces the whole column to be of the data type character. I used as.numeric() to convert them to numbers. I also manually edited the original file to shorten those two column names, just to save typing.

library(ggplot2)
DF <- read.csv("~/R/Play/HIV_AIDS_Diagnoses_by_Neighborhood__Sex__and_Race_Ethnicity_20240205.csv")
DF$AIDS_DIAG <- as.numeric(DF$AIDS_DIAG)
#> Warning: NAs introduced by coercion
DF$HIV_DIAG <- as.numeric(DF$HIV_DIAG)
#> Warning: NAs introduced by coercion
#which rows are not numbers?
which(is.na(DF$AIDS_DIAG))
#>  [1]   12  445  505  614 1080 1431 1995 2388 2499 2623 6329 7352 8216
which(is.na(DF$HIV_DIAG))
#>  [1]  445  721  941 1080 2051 2056 2180 2388 2460 2771 3737 4763 6329 8207 8216
#> [16] 8219

ggplot(DF, 
       mapping = aes(x = HIV_DIAG, y = AIDS_DIAG))+ 
  geom_point() + geom_line()
#> Warning: Removed 24 rows containing missing values (`geom_point()`).
#> Warning: Removed 16 rows containing missing values (`geom_line()`).

Created on 2024-02-05 with reprex v2.0.2

1 Like

I took a slightly different approach to that of FJCC in dealing with the names issue and renamed the whole dataset. Table_names gives a list of equivalencies.

All your variables which we would expect to be numeric are coming in as character, except year. Luckily FJCC spotted the problem. I just coverted all the variables to numeric.

This code, using {data.table] will produce the same plot that FJCC has produced but I think you have more data quality problems. Look at the summaries and the DT[, table(race)] outputs.

It looks like you have some serious outliers intotalhiv & totalaids that almost certainly are typos and I would question two categories of Other/Unknown & Unknown in race. Again I suspect a data-entry error.

suppressMessages(library(data.table)); suppressMessages(library(tidyverse))

DT <- fread("hivny.csv")
newnames <- tolower(c("YEAR", "Borough", "NEIGHBOR", "SEX", "RACE", "TOTALHIV", "HIV10K", 
                      "TOTCON", "PROPCON", "TOTALAIDS", "AIDS10K"))
Table_names <- data.table(oldnames = names(DT), newnames =  newnames)

names(DT) <- newnames

DT[, totalhiv := as.numeric(totalhiv)]
DT[, hiv10k  := as.numeric(hiv10k)]
DT[, totcon  := as.numeric(totcon)] 
DT[, propcon := as.numeric(propcon)]  
DT[,  totalaids := as.numeric( totalaids)]
DT[,  aids10k := as.numeric( aids10k) ]


ggplot(DT,aes(x = totalhiv,  , y = totalaids ))+ geom_point() + geom_line()

DT[ , summary(totalhiv)]
DT[ , summary(totalaids)]
DT[, table(race)]