unnest_(wider/longer) with partially missing fields

As of the most recent release (1.0.0/0.2.2) of tidyselect/vctrs, when fields are missing from an XML document, if the nodes lacking said fields appear first, the data is able to be read, but if they appear after nodes that have the fields, the code results in an error. (This was not an issue in prior versions.)

In my specific instance, all fields are necessary and will be at least partially populated; is there any advice for reading in such data independent of location of the missing fields?

library(dplyr, warn.conflicts = FALSE)
# Also uses tidyr, xml2

# In this order, it successfully reads the data:
xmlstring_ok <- 
'<?xml version="1.0" encoding="utf-8"?>
<Study StudyName="Sandbox" StudyAlias="Sahara">
  <Procedure>
    <Series>
      <InternalSeriesID></InternalSeriesID>
      <WFStep>Not Done</WFStep>
    </Series>
  </Procedure>
  <Procedure>
    <Series>
      <InternalSeriesID>104646</InternalSeriesID>
      <Technician>Tech_Test, Tech Test</Technician>
      <Equipment>
        <EquipmentSerial>123</EquipmentSerial>
        <EquipmentType>Fundus Camera</EquipmentType>
        <EquipmentModel>50DX</EquipmentModel>
        <EquipmentManufacturer>Topcon Corporation</EquipmentManufacturer>
      </Equipment>
      <StudyDate>2019-09-09</StudyDate>
      <WFStep>Verify</WFStep>
    </Series>
  </Procedure>
</Study>'

indat <- xml2::as_list(xml2::read_xml(xmlstring_ok))

work <- tidyr::tibble(inv = indat)


out <- work %>% 
  tidyr::unnest_longer(inv, indices_include = FALSE) %>%
  tidyr::unnest_wider(inv) %>%
  tidyr::unnest_wider(Series) %>%
  tidyr::unnest_wider(Equipment) %>%
  
  dplyr::select(-...1) %>%
  apply(MARGIN = c(1,2), FUN = unlist) %>%
  as.data.frame(stringsAsFactors = FALSE)
#> New names:
#> * `` -> ...1

out
#>     WFStep InternalSeriesID           Technician EquipmentSerial EquipmentType
#> 1 Not Done             <NA>                 <NA>            <NA>          <NA>
#> 2   Verify           104646 Tech_Test, Tech Test             123 Fundus Camera
#>   EquipmentModel EquipmentManufacturer  StudyDate
#> 1           <NA>                  <NA>       <NA>
#> 2           50DX    Topcon Corporation 2019-09-09


# In this order, it breaks:
xmlstring_bad <- 
'<?xml version="1.0" encoding="utf-8"?>
<Study StudyName="Sandbox" StudyAlias="Sahara">
  <Procedure>
    <Series>
      <InternalSeriesID>104646</InternalSeriesID>
      <Technician>Tech_Test, Tech Test</Technician>
      <Equipment>
        <EquipmentSerial>123</EquipmentSerial>
        <EquipmentType>Fundus Camera</EquipmentType>
        <EquipmentModel>50DX</EquipmentModel>
        <EquipmentManufacturer>Topcon Corporation</EquipmentManufacturer>
      </Equipment>
      <StudyDate>2019-09-09</StudyDate>
      <WFStep>Verify</WFStep>
    </Series>
  </Procedure>
  <Procedure>
    <Series>
      <InternalSeriesID></InternalSeriesID>
      <WFStep>Not Done</WFStep>
    </Series>
  </Procedure>
</Study>'



indat <- xml2::as_list(xml2::read_xml(xmlstring_bad))

work <- tidyr::tibble(inv = indat)


out <- work %>% 
  tidyr::unnest_longer(inv, indices_include = FALSE) %>%
  tidyr::unnest_wider(inv) %>%
  tidyr::unnest_wider(Series) %>%
  tidyr::unnest_wider(Equipment)
#> New names:
#> * `` -> ...1
#> Error: Can't cast `Equipment$...1` <logical> to `Equipment$...1` <vctrs_unspecified>.

# dplyr::select(-...1) %>%
# apply(MARGIN = c(1,2), FUN = unlist) %>%
# as.data.frame(stringsAsFactors = FALSE)

Created on 2020-01-28 by the reprex package (v0.3.0)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.