Loading an EBCDIC/Packed Decimal Dataset in R



National Archives HAMLA Data Load

The Problem

I am currently in the process of trying to open data from the National Archives
that measures control of strategic hamlets during the Vietnam War. The problem is
that the data is saved as an EBCDID dataframe file and I am struggling to open and
format the data in r. The National Archives sells a delimmted version of these data
but unfornately, the Archives are currently closed due to the government shutdown and
aren't able to process data requests. While I am comfortable using comma or tab delimmeted data,
I have no experience working with byte/packed decmial data. I would like to be able to open the
data and export it as a csv file. Currently I have two problems:

  1. The necessary IBM037 encoding doesn't seem to be available in R
  2. I'm not sure to read Byte/packed decimal data in R (and to convert it to a csv file)

I have included the links to several pdf files
(hosted by the national archives) that include technical details. The first link
describes the technical details for the dataset of interest (HAMLA, 1967) and the second
provided infomration on the field location of the data. Technical information can
found on page 54 of the pdf (found in the second link). Additionally, a printout of
the header labels can be found at the very end of the pdf.

The first step that I took was to read the raw file to see what the encoded data
looks like using read_file_raw from the readr package.

# URL: https://catalog.archives.gov/search?q=*:*&f.parentNaId=4616225&f.level=fileUnit&sort=naIdSort%20asc&f.fileFormat=(application%2Fpdf%20or%20text%2Fplain)&tabType=online

# Page 29 of hes-technical-documentaton-1967-1974.pdf lays of the HAMLA Data (1967 - 1969)

#---- Packages 


#----- Reads from URL 

hes_url_ext <- "https://catalog.archives.gov/OpaAPI/media/4658138/content/arcmedia/electronic-records/rg-330/HES/RG330.HES.HAMLA67?download=true"

raw_hes_dta <- read_file_raw(hes_url_ext)

options(max.print = 100)

Interestingly, the function guess_encoding, thinks that the encoding is either Big5 or EUC-KR. However, based on several articles that i've read and an online byte editor, I am more confident that the data is actully
IBM037 (also referred to as cp037 and ebcdic-cp-us).


After this, I haven't been able to make much progress. As far as I can tell, the IBM037 encoding is not available in R. None of the aformentioned encodings are referenced in iconvlist.

avail_encodings <- iconvlist() %>%
  tibble() %>%
  filter(str_detect(., "CP037") | str_detect(., "IBM037") | str_detect(., "037"))


I have tried unsuccessfully to use read_fwf and and readBin. I thought that it might be possible to use read_fwf using something similar to the following script:

#------ Reads URL Data With read_fwf

# There are 48 columns 

hes_dta_test <- read_fwf(file = hes_url_ext, fwf_widths(
  c(1, 2, 2, 2, 2, 4, 1, 1, 16, 14, 5, 3, 8, 1, 2, 3, 2, 3, 3, 2, 3, 1, 3, 1, 4,
    4, 4, 4, 4, 4, 2, 5, 1, 1, 3, 1, 1, 1, 1, 1, 19, 1, 8, 8, 8, 8, 8, 8),
  c("CHAM", "PHAM", "DHAM", "VHAM", "HHAM", "DATE", "RECTP", "VALID", "NAME",
    "XNAME", "HPOP", "NUMB", "POINT", "NPA", "CNTRLG", "HTYPE", "CNTL7",
    "PLAC", "SECU", "ADPL", "HEW", "ECDV", "SCSTA", "PROB_1", "URBAN",
    "PROB_3", "ELECT", "PROB_5", "PROB_6", "PROB_7", "PROB_8", "VISIT", "XPROB",
  locale = (encoding = "IBM037"))


inconvlist is system dependent on iconv, where the terminal line command

iconv --list

will give you the available encodings. gcc seems to like the form CCIBMxxxx but I didn't see your 37. IBM has a Java program that reads the format in connection with its CICS transaction gateway product, but I have know idea if it can be hacked to your purpose. vedit.com has a commercial product that claims to be able to read and convert that encoding, but I wasn't able to find anything else.


Thanks for sharing the helpful command! I'll check out vedit to see if it might work as a potential solution.