Creating dataframe from nested list that includes dataframe within list column

realhiphop · February 11, 2019, 7:08pm

I have a list that I extracted from an API in JSON format. I was able to extract the data using jsonlite. The level of JSON data that I am trying to explore is a df that is made up of one of the columns titled player that has additional columns that are giving me issues:

Wildcards: df with 2 columns
Opponents: List
Icons: df with 4 columns

I've tried to unnest the players column and get the following error:

Error in bind_rows_(x, .id) : 
  Argument 2 can't be a list containing data frames

I subset my data as follows:

json <-
  url %>%
  fromJSON()

json_scrape <- json$body$rosters
JSON_list <- json_scrape$teams[1, ]

# Reproduced list from JSON Scrape.  Represents one element (team)\
JSON_list <- structure(list(
  short_name = "Kramerica", total_roster_salary = 22L,
  players = list(structure(list(
    wildcards = structure(list(
      contract = c("1", "1"),
      salary = c("1", "21")), class = "data.frame", row.names = c(NA, 2L)), 
      photo = c(
      "http://sports.cbsimg.net/images/baseball/mlb/players/170x170/1657581.png",
      "http://sports.cbsimg.net/images/baseball/mlb/players/170x170/1670417.png"),
    opponents = list(
      structure(list(
        abbrev = c("OAK", "OAK"),time = c(1553803620L,1553911620L),
        date = c("20190328","20190329")), class = "data.frame", row.names = c(NA, 2L)),
      structure(list(
        abbrev = c("TEX", "TEX", "TEX"), time = c(1553803500L, 1553990700L, 1554062700L), 
        date = c("20190328", "20190330", "20190331")), class = "data.frame", row.names = c(NA,3L))),
    icons = structure(list(
      hot = c(NA, 1L),
      cold = c(1L, NA),
      injury = c("Knee: Questionable for start of season",NA)), class = "data.frame", row.names = c(NA, 21L)), 
    percentstarted = c("48%", "97%"),
    profile_link = c(
      "<a class='playerLink' aria-label=' Jonathan Lucroy C LAA' href='http://baseball.cbssports.com/players/playerpage/1657581'>Jonathan Lucroy</a> <span class=\"playerPositionAndTeam\">C | LAA</span> ",
      "<a class='playerLink' aria-label=' Anthony Rizzo 1B CHC' href='http://baseball.cbssports.com/players/playerpage/1670417'>Anthony Rizzo</a> <span class=\"playerPositionAndTeam\">1B | CHC</span>"),
    id = c("1657581", "1670417"), 
    jersey = c("20", "44"),
    percentowned = c("61%", "99%"),
    pro_team = c("LAA", "CHC"), 
    eligible = c("C,U", "1B,U"),
    owned_by_team_id = c(12L, 12L), 
    profile_url = c(
      "http://baseball.cbssports.com/players/playerpage/1657581",
      "http://baseball.cbssports.com/players/playerpage/1670417"), 
    fullname = c("Jonathan Lucroy", "Anthony Rizzo"), 
    injury = c(NA, "Knee"), 
    return = c("Questionable for start of season", NA)), class = "data.frame", row.names = c(NA, 2L))),
  name = "Kramerica Enterprises", logo = "http://baseball.cbssports.com/images/team-logo/main-36x36.jpg",
  abbr = "KE", id = "12", active_roster_salary = 22L,
  warning = structure(list(description = NA_character_), row.names = 1L, class = "data.frame")
), row.names = 1L, class = "data.frame")

The df I'm looking for would have each team with it's own row. Here's a sample:

# Sample does not include all data frames.  Each row would be a player

tibble::tribble(
  ~short_name, ~total_roster_salary, ~contract, ~salary, ~abbrev, ~hot,    ~id2,
  "Kramerica",                   22,         1,       1,   "OAK",   NA, 1657581,
  "Kramerica",                    2,         1,      21,   "OAK",   NA, 1657581
  )

I'm looking to create a comprehensive df that binds all of the rows for every team from the initial df. I'm guessing that some purrr function could be used, but I'm not sure how to use it.

cderv · February 13, 2019, 7:25am

Are you just looking for a tibble with those 7 variables ? or more ? It is not clear to me.

Also, you have a format no so easy to parse here, because I think fromJSON made some transformation to data.frame for you. I may be (i am really not sure) easier if some data where not unbox.

A few more questions :

Where abbrev comes from ? it is having the value OAK but in the sample data there is also TEX ?
Where hot comes from ? in icons, it has not value NA NA but NA 1

For cleaning such data where, from the start, all the info does not fit into one row, you need to have a clear schema of what data you want, where it comes from in the source and how should it be imported (how many lines it will give when unnest)

When you have that, you can go step by step to extract what you want. purrr::pluck, purrr::pmap and dplyr can help manipulate this.

realhiphop · February 13, 2019, 3:01pm

I'd be looking for a df that includes the variables in the list outlined above. The documentation for the API call is at this website: http://developer.cbssports.com/documentation/api/files/rosters.

The nesting of the data is as follows:

rosters/team/team : this is the level I'm looking to explore. Each team is represented at this level.
I want some of the other columns at this level like:
rosters/teams/team/active_roster_salary

Then one level below there are nested lists at this level: /rosters/teams/team/players. Each element contains the columns that I'm looking for:
Some examples:
/rosters/teams/team/players/player/wildcards/contract
rosters/teams/team/players/player/percentstarted

Abbrev comes from this nested list:
/rosters/teams/team/players/player/opponents/opponent/abbrev

So each player at the /rosters/teams/team/players/player/ level would have different opponents. OAK matches id "1657581", and TEX matches id "1670417".

The icons hot and cold comes from here: /rosters/teams/team/players/player/icons/hot and /rosters/teams/team/players/player/icons/cold. They only get input if a player is defined as "Hot" or "Cold". The NA represent null values.

Hope that makes things a little clearer.

woodward · February 14, 2019, 1:30am

I'm not sure if this is what you're after, but I created these functions for turning objects into strings and back again. Maybe you could convert your problem columns to strings before applying bindrow.

# convert object to string
obj_to_str <- function(obj){
  paste(capture.output(dput(obj)), collapse=" ") # paste avoids line breaks
}

# convert string to object
str_to_obj <- function(str){
  eval(parse(text=str))
}

Or maybe just write a low level loop to extract and restructure the data you want.

realhiphop · March 4, 2019, 8:06pm

Can anyone help me extract the /rosters/teams/team/players/player/level of the data?

cderv · March 5, 2019, 6:56am

What is the url you make the request from exactly ?
When I tried I get missing league id.

From the documentation you linked I see that you can have an xml response. I could be easier to deal with xml using XPATH requesting syntax to extract what you want. It is a powerful syntax to find a node and get a content.

realhiphop · March 5, 2019, 11:28am

The URL takes a league specific access token. I.E. (http://api.cbssports.com/fantasy/league/rosters?version=3.0team_id=all&response_format=JSON&access_token=.......).

I extracted the JSON call out of that data and changed some of the info since it included personal information.

hrbrmstr · March 5, 2019, 10:00pm

It's amazing what you can find on the internet these days (gosh I hate GH)

https://github.com/search?q=api.cbssports.com+token&type=Code

Now, get some "live" data:

httr::GET(
  url = "http://api.cbssports.com/fantasy/league/rosters",
  query = list(
    version = "3.0",
    team_id = "all",
    response_format = "JSON",
    access_token = "U2FsdGVkX1_UOTKnzQ2JGvAyTUPuawlVQE9PoIjeo0aDg2RSJO3VH-J9zNT4uLFubhcDdHD8XXjPy-M04ZKlZLacVlVg7E-B05OCGJ0RgS9sg6WdnK_Me6_KZdaWia6pq_QO3wPCTq6dPRKvv2_AHw"
  )
) -> res

httr::warn_for_status(res) # some of the leaked access tokens no longer work

out <- httr::content(res, as = "text", encoding = "UTF-8") # what a terrible HTTP server/endpoint

out <- jsonlite::fromJSON(out) # JSON that was not meant for data science

At this point we have a similar data structure as your example at the top. HOWEVER you should place an emphasis on the word 'similar'. This result set is missing that "salary" and "contract" data you have. However, it's ugly enough that you can likely extrapolate.

We're doing this to reduce confusing $ use below.

teams <- out$body$rosters$teams
players <- teams$players

And, everybody needs some base R practice every now and again:

do.call(
  rbind.data.frame,
  mapply(function(team_df, players_df) {
    
    players_df[,c("firstname", "on_waivers", "photo", "eligible_for_offense_and_defense",
                  "position", "update_type", "roster_pos", "lastname", "eligible", "age", 
                  "is_locked", "elias_id", "ytd_points", "owned_by_team_id", "bats", 
                  "roster_status", "percentstarted", "profile_link", "id", "is_keeper", 
                  "pro_status", "on_waivers_until", "profile_url", "jersey", "fullname",
                  "percentowned", "headline", "pro_team", "throws", "starting-pitcher-today", 
                  "injury", "return")] -> pdf
    
    pdf[["projected_points"]] <- team_df[["projected_points"]]
    pdf[["long_abbr"]] <- team_df[["long_abbr"]]
    pdf[["lineup_status"]] <- team_df[["lineup_status"]]
    pdf[["short_name"]] <- team_df[["short_name"]]
    pdf[["division"]] <- team_df[["division"]]
    pdf[["name"]] <- team_df[["name"]]
    pdf[["logo"]] <- team_df[["logo"]]
    pdf[["abbr"]] <- team_df[["abbr"]]
    pdf[["point"]] <- team_df[["point"]]
    pdf[["id"]] <- team_df[["id"]]  
    
    pdf
    
  }, split(teams, 1:nrow(teams)), players, SIMPLIFY = FALSE, USE.NAMES = FALSE)
) %>% 
  dplyr::glimpse()

Which looks like:

## Observations: 592
## Variables: 41
## $ firstname                        <chr> "Wilson", "Mike", "Matt", "Jonathan", "Matt"…
## $ on_waivers                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ photo                            <chr> "http://sports.cbsimg.net/images/baseball/ml…
## $ eligible_for_offense_and_defense <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ position                         <chr> "C", "C", "1B", "2B", "1B", "SS", "SS", "2B"…
## $ update_type                      <chr> "normal", "normal", "normal", "normal", "nor…
## $ roster_pos                       <chr> "C", "C", "1B", "2B", "3B", "SS", "MI", "MI"…
## $ lastname                         <chr> "Ramos", "Zunino", "Olson", "Villar", "Carpe…
## $ eligible                         <chr> "C,U", "C,U", "1B,CI,U", "2B,MI,U", "1B,3B,C…
## $ age                              <int> 31, 27, 24, 27, 33, 25, 24, 30, 30, 24, 20, …
## $ is_locked                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ elias_id                         <chr> "RAM571096", "ZUN334886", "OLS682188", "VIL4…
## $ ytd_points                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ owned_by_team_id                 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ bats                             <chr> "R", "R", "L", "S", "L", "R", "L", "R", "L",…
## $ roster_status                    <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A",…
## $ percentstarted                   <chr> "79%", "37%", "71%", "46%", "92%", "52%", "3…
## $ profile_link                     <chr> "<a class='playerLink' aria-label=' Wilson R…
## $ id                               <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1",…
## $ is_keeper                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pro_status                       <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A",…
## $ on_waivers_until                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ profile_url                      <chr> "http://rfbl2006.baseball.cbssports.com/play…
## $ jersey                           <chr> "40", "10", "28", "2", "13", "7", "5", "15",…
## $ fullname                         <chr> "Wilson Ramos", "Mike Zunino", "Matt Olson",…
## $ percentowned                     <chr> "89%", "52%", "87%", "59%", "95%", "69%", "5…
## $ headline                         <chr> "Mets' Wilson Ramos: Flashing power early", …
## $ pro_team                         <chr> "NYM", "TB", "OAK", "BAL", "STL", "CHW", "LA…
## $ throws                           <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R",…
## $ `starting-pitcher-today`         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ injury                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ return                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ projected_points                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ long_abbr                        <chr> "Coco", "Coco", "Coco", "Coco", "Coco", "Coc…
## $ lineup_status                    <chr> "ok", "ok", "ok", "ok", "ok", "ok", "ok", "o…
## $ short_name                       <chr> "#IKnow", "#IKnow", "#IKnow", "#IKnow", "#IK…
## $ division                         <chr> "", "", "", "", "", "", "", "", "", "", "", …
## $ name                             <chr> "#IKnowI'mBetterThanThis", "#IKnowI'mBetterT…
## $ logo                             <chr> "http://rfbl2006.baseball.cbssports.com/imag…
## $ abbr                             <chr> "CFY", "CFY", "CFY", "CFY", "CFY", "CFY", "C…
## $ point                            <chr> "20190320", "20190320", "20190320", "2019032…

So, the general idea is to separate the ugly fields from the non-ugly ones and bind them all together separately.

You can change this up to make it a bit more dynamic by getting the gnarlier names at each distinct step programmatically (e.g. sapply(teams, is.atomic)) , using base R set functions to partition them out and then have the rbinding be a bit more dynamic.

realhiphop · March 5, 2019, 11:29pm

This works. Thanks!

I’m going to try to play around with this to see if I can come up with a 'tidyverse' solution.

realhiphop · March 6, 2019, 4:16pm

Using @hrbrmstr's example as a template. Here's what I came up with. I sought to create my own tidyverse solution. Open to suggestions for improving the code. I realize there's a lot of room for improvement!

@hrbrmstr Code

httr::GET(
 url = "http://api.cbssports.com/fantasy/league/rosters",
 query = list(
   version = "3.0",
   team_id = "all",
   response_format = "JSON",
   access_token = "U2FsdGVkX1_UOTKnzQ2JGvAyTUPuawlVQE9PoIjeo0aDg2RSJO3VH-J9zNT4uLFubhcDdHD8XXjPy-M04ZKlZLacVlVg7E-B05OCGJ0RgS9sg6WdnK_Me6_KZdaWia6pq_QO3wPCTq6dPRKvv2_AHw"
 )
) -> res

httr::warn_for_status(res) # some of the leaked access tokens no longer work

out <- httr::content(res, as = "text", encoding = "UTF-8") # what a terrible HTTP server/endpoint

out <- jsonlite::fromJSON(out) # JSON that was not meant for data science

teams_raw <- out$body$rosters$teams
players_raw <- teams_raw$players

My changes:

extract_player_data <- function(players_df){
  
  # Columns from Players
  temp_table <- players_df[, c(
    "firstname", "on_waivers", "photo", "eligible_for_offense_and_defense",
    "position", "update_type", "roster_pos", "lastname", "eligible", "age", 
    "is_locked", "elias_id", "ytd_points", "owned_by_team_id", "bats", 
    "roster_status", "percentstarted", "profile_link", "id", "is_keeper", 
    "pro_status", "on_waivers_until", "profile_url", "jersey", "fullname",
    "percentowned", "headline", "pro_team", "throws", "starting-pitcher-today", 
    "injury", "return")]
  
  # Extract Nested Icons and Create Tibble
  headline <- players_df$icons$headline
  hot <- players_df$icons$hot
  cold <- players_df$icons$cold
  injury <- players_df$icons$injury
  icons <- tibble(headline, hot, cold, injury)
  
  tbl <-  bind_cols(temp_table, icons)

}

# Rename Team ID Column for Join
teams <- teams_raw %>% 
  dplyr::rename(owned_by_team_id = id) %>% 
  dplyr::mutate(owned_by_team_id = as.numeric(owned_by_team_id))

rosters <- purrr::map_dfr(players_raw, extract_player_data) %>% bind_rows()

full_rosters <- left_join(teams, rosters, by ="owned_by_team_id" ) %>% 
  dplyr::select(-players)

system · March 13, 2019, 4:16pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.