XML complex parse with xml2 and xmltools

xml2
tidyverse
xml

#1

Hi everyone,

I'm facing a complex XML file, and have found some awesome resources and examples from, @richardo , @cderv , @danton , to get started, but am running into difficulties with returning multiple node attributes, from various levels, and subsequent child elements. In addition, I need the output in a tidy data frame format.

My code below builds off of @richardo's post, Generate a data frame from many xml files , and from there, I'm trying to incorporate @danton's xmltools package for element values.

Please see the attached screenshot of the output I'm trying to get.

I've included a reprex, please see below, and am looking for the following,

  # need attributes from
  # /Package/PackageBody/InvestmentVehicle
  # /Package/PackageBody/InvestmentVehicle/FundShareClass
  # /Package/PackageBody/InvestmentVehicle/PortfolioList/Portfolio
  # /Package/PackageBody/InvestmentVehicle/PortfolioList/Portfolio/Holding/HoldingDetail

  # need text/values from
  # /Package/PackageBody/InvestmentVehicle/FundShareClass/Operation/ShareClassBasics/Name
  # /Package/PackageBody/InvestmentVehicle/FundShareClass/Operation/ShareClassBasics/LegalType
  # /Package/PackageBody/InvestmentVehicle/PortfolioList/Portfolio/Holding/HoldingDetail/Symbol
  # including all through
  # /Package/PackageBody/InvestmentVehicle/PortfolioList/Portfolio/Holding/HoldingDetail/FirstBoughtDate

  # do not need
  # /Package/PackageBody/InvestmentVehicle/PortfolioList/Portfolio/PortfolioSummary
  # /Package/PackageBody/InvestmentVehicle/PortfolioList/Portfolio/PortfolioBreakdown

Essentially, I'm not able to return the element values for the HoldingDetails, while also repeating the InvestmentVehicle, FundShareClass, and Portfolio attributes for respective HoldingDetail rows. I also need the InvestmentVehicle child elements of Name and LegalType to repeat for each respective HoldingDetail row.

    library(xmltools)
    library(xml2)
    library(dplyr)
    library(tidyr)
    library(purrr)
    library(xmltools)
    install.packages("reprex")

    ###====XML====###

    text <- '<Package>
	<PackageHeader>
<PackageName>AllHoldings</PackageName>
<Universe>FO</Universe>
<AsOfDate>2018-11-09</AsOfDate>
<Version>2.5</Version>
</PackageHeader>
<PackageBody>
<InvestmentVehicle _Id="F00000ANDY">
<FundShareClass _Id="F00000ANDY">
<Operation>
<ShareClassBasics>
<Name>Andy Heritage Fund A</Name>
<LegalType _Id="FO">FO</LegalType>
</ShareClassBasics>
</Operation>
</FundShareClass>
<PortfolioList>
<Portfolio xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" _CurrencyId="EUR" _ExternalId="ANDY01111111">
<PortfolioSummary>
<Date>2015-01-31</Date>
<HoldingAggregate _SalePosition="L">
<NumberOfHolding>4</NumberOfHolding>
<NumberOfStockHolding>2</NumberOfStockHolding>
<NumberOfBondHolding>0</NumberOfBondHolding>
<TotalMarketValue>1000000</TotalMarketValue>
</HoldingAggregate>
<HoldingAggregate _SalePosition="S">
<NumberOfHolding>1</NumberOfHolding>
<NumberOfStockHolding>0</NumberOfStockHolding>
<NumberOfBondHolding>0</NumberOfBondHolding>
<TotalMarketValue>10</TotalMarketValue>
</HoldingAggregate>
</PortfolioSummary>
<PortfolioBreakdown _SalePosition="L">
<AssetAllocation>
<BreakdownValue Type="1">50.00</BreakdownValue>
<BreakdownValue Type="2">45.00</BreakdownValue>
<BreakdownValue Type="3">0.00</BreakdownValue>
<BreakdownValue Type="4">0.00</BreakdownValue>
<BreakdownValue Type="5">0.00</BreakdownValue>
<BreakdownValue Type="6">0.00</BreakdownValue>
<BreakdownValue Type="7">5.00</BreakdownValue>
<BreakdownValue Type="8">0.00</BreakdownValue>
</AssetAllocation>
</PortfolioBreakdown>
<Holding>
<HoldingDetail _Id="XYZ123456">
<Symbol>ANDY</Symbol>
<Country _Id="USA">United States</Country>
<Currency _Id="USD">US Dollar</Currency>
<SecurityName>ANDY Co</SecurityName>
<LegalType>E</LegalType>
<Weighting>7.89</Weighting>
<NumberOfShare>1000</NumberOfShare>
<ShareChange>0</ShareChange>
<MarketValue>500000</MarketValue>
<Sector>10</Sector>
<HoldingYTDReturn>10</HoldingYTDReturn>
<Region>1</Region>
<StyleBox>1</StyleBox>
<FirstBoughtDate>2010-01-31</FirstBoughtDate>
</HoldingDetail>
<HoldingDetail _Id="NMO123456">
<Symbol>YDNA</Symbol>
<Country _Id="CHE">Switzerland</Country>
<Currency _Id="CHF">Swiss Franc</Currency>
<SecurityName>Andy Shares</SecurityName>
<LegalType>E</LegalType>
<Weighting>1.23</Weighting>
<NumberOfShare>10000</NumberOfShare>
<ShareChange>0</ShareChange>
<MarketValue>50000</MarketValue>
<Sector>1</Sector>
<Region>9</Region>
<StyleBox>4</StyleBox>
<FirstBoughtDate>2010-10-31</FirstBoughtDate>
</HoldingDetail>
<HoldingDetail>
<SecurityName>Us Dollar Spot</SecurityName>
<LegalType>CH</LegalType>
<Weighting>0.5</Weighting>
<NumberOfShare>100</NumberOfShare>
<ShareChange>10</ShareChange>
<MarketValue>100000</MarketValue>
</HoldingDetail>
<HoldingDetail>
<SecurityName>British Pound Spot</SecurityName>
<LegalType>CH</LegalType>
<Weighting>0.75</Weighting>
<NumberOfShare>200</NumberOfShare>
<ShareChange>20</ShareChange>
<MarketValue>50000</MarketValue>
</HoldingDetail>
</Holding>
</Portfolio>
</PortfolioList>
</InvestmentVehicle>
<InvestmentVehicle _Id="F00000YDNA">
<FundShareClass _Id="F00000YDNA">
<Operation>
<ShareClassBasics>
<Name>Andy Heritage Fund B</Name>
<LegalType _Id="FO">FO</LegalType>
</ShareClassBasics>
</Operation>
</FundShareClass>
<PortfolioList>
<Portfolio xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" _CurrencyId="EUR" _ExternalId="ANDY02222222">
<PortfolioSummary>
<Date>2015-06-30</Date>
<HoldingAggregate _SalePosition="L">
<NumberOfHolding>2</NumberOfHolding>
<NumberOfStockHolding>10</NumberOfStockHolding>
<NumberOfBondHolding>0</NumberOfBondHolding>
<TotalMarketValue>700000</TotalMarketValue>
</HoldingAggregate>
<HoldingAggregate _SalePosition="S">
<NumberOfHolding>1</NumberOfHolding>
<NumberOfStockHolding>0</NumberOfStockHolding>
<NumberOfBondHolding>0</NumberOfBondHolding>
<TotalMarketValue>100</TotalMarketValue>
</HoldingAggregate>
</PortfolioSummary>
<PortfolioBreakdown _SalePosition="L">
<AssetAllocation>
<BreakdownValue Type="1">53.00</BreakdownValue>
<BreakdownValue Type="2">40.00</BreakdownValue>
<BreakdownValue Type="3">0.00</BreakdownValue>
<BreakdownValue Type="4">0.00</BreakdownValue>
<BreakdownValue Type="5">0.00</BreakdownValue>
<BreakdownValue Type="6">0.00</BreakdownValue>
<BreakdownValue Type="7">7.00</BreakdownValue>
<BreakdownValue Type="8">0.00</BreakdownValue>
</AssetAllocation>
</PortfolioBreakdown>
<Holding>
<HoldingDetail _Id="DEF0123456">
<Symbol>DEF</Symbol>
<Country _Id="USA">United States</Country>
<Currency _Id="USD">US Dollar</Currency>
<SecurityName>DEF Inc</SecurityName>
<LegalType>E</LegalType>
<Weighting>12.0</Weighting>
<NumberOfShare>50000</NumberOfShare>
<ShareChange>0</ShareChange>
<MarketValue>1000000</MarketValue>
<Sector>10</Sector>
<HoldingYTDReturn>25.00</HoldingYTDReturn>
<Region>7</Region>
<StyleBox>3</StyleBox>
<FirstBoughtDate>2015-09-30</FirstBoughtDate>
</HoldingDetail>
<HoldingDetail _Id="GHI0123456">
<Symbol>GFRT</Symbol>
<Country _Id="CHE">Switzerland</Country>
<Currency _Id="CHF">Swiss Franc</Currency>
<SecurityName>GFRT Shares</SecurityName>
<LegalType>E</LegalType>
<Weighting>5.87</Weighting>
<NumberOfShare>100</NumberOfShare>
<ShareChange>0</ShareChange>
<MarketValue>900000</MarketValue>
<Sector>3</Sector>
<Region>2</Region>
<StyleBox>1</StyleBox>
<FirstBoughtDate>2015-12-31</FirstBoughtDate>
</HoldingDetail>
</Holding>
</Portfolio>
</PortfolioList>
</InvestmentVehicle>
</PackageBody>
</Package>'
    ###====XML====###

    ###=====================================progress===============================###
    # xml2
    docReprex <- text %>%
      xml2::read_xml()
    df <- xml2::xml_find_all(docReprex, "//InvestmentVehicle") %>%
      map_dfr(~ {
        
        parent <- xml_attrs(.) %>% enframe() %>% spread(name, value)
        
        kids <- xml_children(.) %>% map_dfr(~as.list(xml_attrs(.)))
        
        grandkids <- xml_children(xml_children(.)) %>% map_dfr(~as.list(xml_attrs(.)))
        
        greatgrandkids <- xml_children(xml_children(xml_children(.))) %>% map_dfr(~as.list(xml_attrs(.)))
        
        cbind.data.frame(parent, kids, grandkids, greatgrandkids) %>% set_tidy_names() %>% as_tibble()
      })

    # xmltools
    # get all xpaths to parents of parent node
    termParentReprex <- docReprex %>% 
      xml_get_paths(only_terminal_parent = TRUE)
    termNodeReprex <- docReprex %>%
      xml_get_paths()

    termXPathsReprex <- termParentReprex %>%
      unlist() %>%
      unique()
    dfReprex1 <- lapply(termXPathsReprex, xml_to_df, file = text, is_xml = FALSE, dig = FALSE) %>%
      dplyr::bind_cols()
    #> Error in setnames(x, value): Can't assign 1 names to a 2 column data.table
    ###=====================================progress===============================###

Many thanks for your time, and feel free to reach out if I can send along additional details.

Andrew


closed #2

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.