Workaround for Extracting Text from PDF Files Using pdftotext in R

Introduction to readPDF (tm) Package in R

The readPDF package in R is designed to read and parse PDF files, but unfortunately, it has been plagued by bugs and inconsistencies. In this article, we will explore the issues with readPDF and provide a workaround using alternative tools and techniques.

Background on xpdf and pdftotext

Before diving into the solutions, it’s essential to understand the role of xdiff and pdftotext. xpdf is an X-based PDF viewer that allows users to extract text from PDF files. The pdftotext command within xpdf is used to convert PDF pages to plain text.

In order for readPDF to work, it relies on the availability of xdiff and pdftotext. However, if these tools are not properly installed or configured, readPDF will fail.

Issues with readPDF

The issues with readPDF include:

  • Bugs: readPDF has a history of bugs that make it unreliable.
  • Lack of configuration options: Unlike other PDF parsing tools, readPDF does not provide many configuration options to fine-tune the parsing process.
  • Inconsistent results: The output of readPDF can be inconsistent, and it may fail to extract certain information from the PDF file.

Workaround using pdftotext

Given these issues with readPDF, a better approach is to use pdftotext as a standalone tool to extract text from PDF files. Here’s how you can do it:

Installing xpdf and Setting Up PATHs

To use pdftotext, you need to have xpdf installed on your system. The installation process varies depending on the operating system. For Windows, you can download the installer from the official website.

Once you’ve installed xpdf, you need to add its PATH to your system’s PATH environment variable. This ensures that the command-line tools within xpdf are available for use in your terminal or command prompt.

Using pdftotext to Extract Text

Here’s an example of how you can use pdftotext to extract text from a PDF file:

system(paste('"C:/Program Files/xpdf/pdftotext.exe"', 
             '"C:/Users/FCG/Desktop/NoteF7000.pdf"'), wait=FALSE)

This command uses pdftotext to extract the text from the specified PDF file and stores it in a temporary text file.

Reading Text into R

Once you have extracted the text using pdftotext, you can read it into R using the Corpus function from the tm package:

require(tm)
mycorpus <- Corpus(URISource("C:/Users/FCG/Desktop/NoteF7001.txt"))
inspect(mycorpus)

A corpus with 1 text document

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
Market Notice
Number: Date F7001 08 May 2013

New IDX SSF (EWJG) The following new IDX SSF contract will be added to the list and will be available for trade today.

Summary Contract Specifications Contract Code Underlying Instrument Bloomberg Code ISIN Code EWJG EWJG IShares MSCI Japan Index Fund (US) EW US EQUITY US4642868487 1 (R1 per point)

Contract Size / Nominal

Expiry Dates &amp; Times

10am New York Time; 14 Jun 2013 / 16 Sep 2013

Underlying Currency Quotations Minimum Price Movement (ZAR) Underlying Reference Price

USD/ZAR Bloomberg Code (USDZAR Currency) Price per underlying share to two decimals. R0.01 (0.01 in the share price)

4pm underlying spot level as captured by the JSE.

Currency Reference Price

The same method as the one utilized for the expiry of standard currency futures on standard quarterly SAFEX expiry dates.

JSE Limited Registration Number: 2005/022939/06 One Exchange Square, Gwen Lane, Sandown, South Africa. Private Bag X991174, Sandton, 2146, South Africa. Telephone: +27 11 520 7000, Facsimile: +27 11 520 8584, www.jse.co.za

Executive Director: NF Newton-King (CEO), A Takoordeen (CFO) Non-Executive Directors: HJ Borkum (Chairman), AD Botha, MR Johnston, DM Lawrence, A Mazwai, Dr. MA Matooane , NP Mnxasana, NS Nematswerani, N Nyembezi-Heita, N Payne Alternate Directors: JH Burke, LV Parsons

Member of the World Federation of Exchanges

Company Secretary: GC Clarke
Settlement Method

Cash Settled

-

Clearing House Fees -

On-screen IDX Futures Trading: o 1 BP for Taker (Aggressor) o Zero Booking Fees for Maker (Passive) o No Cap o Floor of 0.01 Reported IDX Futures Trades o 1.75 BP for both buyer and seller o No Cap o Floor of 0.01

Initial Margin Class Spread Margin V.S.R. Expiry Date

R 10.00 R 5.00 3.5 14/06/2013, 16/09/2013

The above instrument has been designated as "Foreign" by the South African Reserve Bank

Should you have any queries regarding IDX Single Stock Futures, please contact the IDX team on 011 520-7399 or <a>[email@idx.com](mailto[email@idx.com)</a>

Graham Smale Director: Bonds and Financial Derivatives Tel: +27 11 520 7831 Fax:+27 11 520 8831 E-mail: <a>[email@grahamsmale.com](mailto:email@grahamsmale.com)</a>

Distributed by the Company Secretariat +27 11 520 7346

Conclusion

The readPDF package in R has been plagued by bugs and inconsistencies. While it provides a convenient way to read PDF files, its limitations make it less reliable than other tools.

By using pdftotext as a standalone tool to extract text from PDF files, you can overcome the issues with readPDF. This approach requires you to install and configure xpdf, but it provides more control over the parsing process and allows for better results.


Last modified on 2025-05-05