General Help with Text Extraction


#1

Introduction

Apologies if this is already covered ad nauseaum, but I haven’t been able to find an example matching my needs. Here’s an overview of what I’m trying to accomplish:

I have text files (albeit with non-.txt extensions) generated as exports from proprietary software that I would like to parse into a tibble for management and analysis, then parse back into the native format to upload any changes. These files have a consistent structure, similar to that of JSON/XML/HTML; ideally, they could be harvested/scraped in the same way one would with a website, but I have a feeling that’s too ambitious for my current needs.

RegEx has gotten me only so far, and I have a feeling there’s a better/efficient way to do this. Can anyone help identify a method or strategy? Examples below:

CSV/JSON-Like Document:

There are two ‘components’ in the following sample text that exemplify the entire document:

[ProcedureOfOrigin,Export
(CatalogType,
      [ComponentReference,Add
      (IsActive,TRUE)
      (ComponentProperties,
            [CatalogDocument,Find
            (Name,"Foo")
            (ScopeOfFunction,"1")
            (DocumentType,"0")
            ])
      (DocumentProperties,
            [DocumentSubType,Find
            (Description,"Foo Document for Production")
            (Name,"Foo Document")
            ])
      ])
]
[ProcedureOfOrigin,Export
(CatalogType,
      [ComponentReference,Add
      (IsActive,TRUE)
      (ComponentProperties,
            [CatalogDocument,Find
            (Name,"Bar")
            (ScopeOfFunction,"1")
            (DocumentType,"0")
            ])
      (DocumentProperties,
            [DocumentSubType,Find
            (Description,"Bar Document for Production")
            (Name,"Bar Document")
            ])
      ])
]

When considered as a Template, I’m looking for values after almost every comma:

[Variable1,Value1
(SubSection1,
      [Variable2,Value2
      (Variable3,Value3)
      (SubSection2,
            [Variable4,Value4
            (Variable5,"Value5")
            (Variable6,"Value6")
            (Variable7,"Value7:")
            ])
      (SubSection3,
            [Variable8,Value8
            (Variable9,"Value9")
            (Variable10,"Value10")
            ])
      ])
]

Desired Output for the JSON/CSV-Like Document:

ProcedureOfOrigin ComponentReference CatalogDocument Name ScopeOfFunction DocumentType
Export Add Find Foo 1 0
Export Add Find Bar 1 0

XML/HTML-Like Document:

The other exported file has a template like the following:

Section1.0:
	SubSection1.1:  Value1;;
	SubSection1.2:  Value2;;
	SubSection1.3:  Value3;;
	SubSection1.4:  Value4;;
	SubSection1.5:  Value5;;
	SubSection1.6:  Value6;;
	SubSection1.7:  Value7;;
	SubSection1.8:  YYYY-MM-DD;;
	SubSection1.9:  Value9;;

Section2.0:
	SubSection2.1: This can be a very large block of text with /* Comments in between */ 
	;;
	SubSection2.2: Same thing for this and the rest of the following sections. 
	;;
	SubSection2.3: /****** Comments can sometimes take this form *****/
	;;
	SubSection2.4: And so-on.
	;;
Section3.0:
	SubSection3.1: Usually a two-word-phrase;;
	SubSection3.2:
	/* These comments can be OBNOXIOUS
	and be multi-
	line
	With any characters in them
	Less important for me to have in general */
	;;
	Section4.0:  Integer
	;;
	Section5.0:  Block of Text
	;;
	Section6.0: If-Then Statements + Conclusions.
	;;
	Section7.0:
	;;
Section8.0: Integer;;
Section9.0: End-of-Document

Desired Output for XML-Like Document:

Each Section would be its own tibble (think Normalized Relational Database).

SubSection1.1 SubSection1.2 SubSection1.3 SubSection1.4
Value1 Value2 Value3 Value4