Extract Text fom PDF

Hi guys do you know how to extract text from pdf? I will show a small example of the pdf. I would like to extract the code and the task like" 1111" and the tasks " presiding over or participating in the proceedings of legislative bodies and administrative councils of national, state, regional or local governments or legislative assemblies..."
https://drive.google.com/file/d/1fJUBtB2S3xCDqJzGfdOiOdIWx4Hs_70M/view?usp=sharing

1 Like

Hi, check this:

library(pdftools)
library(tesseract)

file <- pdftools::pdf_convert("C:\\Users\\macosta\\Downloads\\TEST.pdf", dpi = 600) # put his path.
text <- tesseract::ocr(file)
cat(text)


# 1111 Legislators
# Legislators determine, formulate and direct policies of national, state, regional or local
# governments and international governmental agencies, and make, ratify, amend or repeal laws,
# public rules and regulations. They include elected and non-elected members of parliaments,
# councils and governments.
# Tasks include —
# 
# (a) presiding over or participating in the proceedings of legislative bodies and
# administrative councils of national, state, regional or local governments or legislative
# assemblies;
# 
# (b) determining, formulating and directing policies of national, state, regional or local
# governments;
# 
# (c) making, ratifying, amending or repealing laws, public rules and regulations within a
# statutory or constitutional framework;
# 
# (d) serving on government administrative boards or official committees;
# 
# (e) investigating matters of concern to the public and promoting the interests of the
# constituencies which they represent;
# 
# (f) attending community functions and meetings to provide service to the community,
# understand public opinion and provide information on government plans;
# 
# (g) negotiating with other legislators and representatives of interest groups in order to
# reconcile differing interests, and to create policies and agreements;
# 
# (h) as members of the government, directing senior administrators and officials of
# government departments and agencies in the interpretation and implementation of
# government policies.
# 
# Examples of the occupations classified here:
#   = City councillor
# = Government minister
# = Mayor
# = Member of parliament
# = President (government)
# = Secretary of state
# = Senator
# = State governor
# 1112 Senior Government Officials
# Senior government officials advise governments on policy matters, oversee the interpretation
# and implementation of government policies and legislation by government departments and
# agencies, represent their country abroad and act on its behalf, or carry out similar tasks in
# intergovernmental organizations. They plan, organize, direct, control and evaluate the overall
# activities of municipal or local, regional and national government departments, boards, agencies
# or commissions in accordance with legislation and policies established by government and
# legislative bodies.
# Tasks include —
# 
# (a) advising national, state, regional or local governments and legislators on policy
# matters;
# 
# (b) advising on the preparation of government budgets, laws and regulations, including
# amendments;
# 
# (c) establishing objectives for government departments or agencies in accordance with
# government legislation and policy;
# 
# (d) formulating or approving and evaluating programmes and procedures for the
# implementation of government policies in conjunction or consultation with
# government,
# 
# (e) recommending, reviewing, evaluating and approving documents, briefs and reports
# submitted by middle managers and senior staff members;
# 
# (f) ensuring appropriate systems and procedures are developed and implemented to
# provide budgetary control;
# 
# (g) coordinating activities with other senior government managers and officials;
# 
# (h) making presentations to legislative and other government committees regarding
# policies, programmes or budgets;
# 
# (1) overseeing the interpretation and implementation of government policies and
# legislation by government departments and agencies.
# 
# Examples of the occupations classified here:
#   = Ambassador
# = City administrator
# = Civil service commissioner
# = Consul-general
# = Director-general (government department)
# = Director-general (intergovernmental organization)
# = Fire commissioner
# = Inspector-general (police)
# = Permanent head (government department)
# = Police chief constable
# = Police commissioner
# = Secretary-general (government administration)
# = Under-secretary (government)
# Note
# Chief executives of government-owned enterprises are included in Unit Group 1120: Managing Directors and
# Chief Executives.
# 1113 Traditional Chiefs and Heads of Villages
# Traditional chiefs and heads of villages perform a variety of legislative, administrative and
# ceremonial tasks and duties, determined by ancient traditions as well as by the division of rights
# and responsibilities between village chiefs and the local, regional and national authorities.
# Tasks include —
# (a) allocating the use of communal land and other resources among households in the
# community or village;
# (b) collecting and distributing surplus production of the community or village;
# (c) settling disputes between members of the community or village;
# (d) disciplining members of the community or village for violation of rules and customs;
# (e) performing ceremonial duties in connection with births, marriages, deaths, harvests
# and other important occasions;
# (f) representing the community or village on local or regional councils;
# (g) informing the community or village about government rules and regulations.
# Examples of the occupations classified here:
#   = Village chief
# = Village head
1 Like

Thank you. Do you know how to extract only the tasks?

With regular expression, I'm try to make.

I've had the issue that pdf tables are not actually tables, so extracting them is very difficult

Any tools specializing in PDF table extraction?

To extract pdf tables, you can try the {tabulizer} package (GitHub - ropensci/tabulizer: Bindings for Tabula PDF Table Extractor Library) - note that this requires installation of some Java dependencies, specifically the Tabula library.