Parsing a PDF file in Java

Parsing a PDF file is sometimes necessary to extract information from it or maybe for automation. In this article we learn how to parse a PDF file in java using the well-known apache pdfbox library.

Extracting Text from a PDF

The first example we are going to tackle is to extract and show the text from a PDF file using pdfbox. We are going to print just the text without any style or location information. This example serves to get us started using the pdfbox library.

Maven build file

Add the following maven dependency declaration to build the example with pdfbox.

<dependency>
  <groupId>org.apache.pdfbox</groupId>
  <artifactId>pdfbox</artifactId>
  <version>2.0.29</version>
</dependency>

Loading the PDF file

To load the PDF file, we use the static method of PDDocument class as follows:

PDDocument doc = PDDocument.load(new File(args[0]));

Number of pages in the PDF

Let us check the number of pages in the PDF file as follows:

int npages = doc.getNumberOfPages();
System.out.printf("document has %d pages\n", npages);

Creating a text stripper

The main class used to extract text from the PDF is known as PDFTextStripper. It has a number of methods which get called back while processing the PDF. Intiialize the instance like this.

PDFTextStripper stripper = new PDFTextStripper();

Looping over the pages

A PDF file may have multiple pages. Here is how you can loop over each page.

for (int np = 1 ; np <= npages ; np++) {
  stripper.setStartPage(np);
  stripper.setEndPage(np);
  // process the page here
}

Extracting text from the page

We can now extract the text using the method PDFTextStripper.getText() and print it.

for (int np = 1 ; np <= npages ; np++) {
  stripper.setStartPage(np);
  stripper.setEndPage(np);
  String text = stripper.getText(doc);
  System.out.printf("Page %d\n", np);
  System.out.println(text);
}

After processing the document, we need to close the document to free up resources.

doc.close()

And that is how easy it is to extract text from a PDF. By storing the extracted text, you can provide a simple search functionality by searching through it using something like Lucene.

Extracting Text Locations from PDF

Let us now see how to extract text along with location of the text on the page using pdfbox. The setup is similar to the above in that we extend PDFTextStripper class and override the PDFTextStripper.writeString method to get the text locations.

To that end, we create extend PDFTextStripper in a class called ShowTextLocation as follows:

public class ShowTextLocation extends PDFTextStripper
{
}

The program creates an image from the PDF consisting of just the text with all the text transformations applied - the font style, weight, size, rotation, shear, etc. Each text element is then outlined to show the bounding box so you can check the correctness of the coordinates extracted.

We override the writeString() method and process each TextPosition element with this method. Here we basically apply a series of transformations to each text element. After rendering it onto an image, we extract the locations and sizes.

@Override
protected void writeString(String string, List<TextPosition> textPositions) throws java.io.IOException
{
    for (TextPosition text : textPositions) {
        AffineTransform at = text.getTextMatrix().createAffineTransform();
        Rectangle2D.Float rect = new Rectangle2D.Float(0, 0, 
                               text.getWidthDirAdj() / text.getTextMatrix().getScalingFactorX(),
                               text.getHeightDir() / text.getTextMatrix().getScalingFactorY());
        Shape s = at.createTransformedShape(rect);
        s = flipAT.createTransformedShape(s);
        s = rotateAT.createTransformedShape(s);
        g2d.setColor(Color.red);
        g2d.draw(s);
        PDFont font = text.getFont();
        BoundingBox bbox = font.getBoundingBox();
        float xadvance = font.getWidth(text.getCharacterCodes()[0]); // todo: should iterate all chars
        rect = new Rectangle2D.Float(0, bbox.getLowerLeftY(), xadvance, bbox.getHeight());
        if (font instanceof PDType3Font) {
            at.concatenate(font.getFontMatrix().createAffineTransform());
        } else {
            at.scale(1/1000f, 1/1000f);
        }
        s = at.createTransformedShape(rect);
        s = flipAT.createTransformedShape(s);
        s = rotateAT.createTransformedShape(s);

        g2d.setColor(Color.blue);
        g2d.draw(s);
        Rectangle2D bounds = s.getBounds2D();
        System.out.printf("[%s] (%f, %f, %f %f)\n", text.toString(),
                  bounds.getX(), bounds.getY(), bounds.getWidth(), bounds.getHeight());
    }
}

The code for creating the image on which the text is drawn with the bounding boxes is shown in the following method stripPage() which is invoked while processing each page of the PDF.

private void stripPage(int page) throws java.io.IOException
{
    PDFRenderer pdfRenderer = new PDFRenderer(document);
    BufferedImage image = pdfRenderer.renderImage(page, SCALE);
    PDPage pdPage = document.getPage(page);
    PDRectangle cropBox = pdPage.getCropBox();

    // flip y-axis
    flipAT = new AffineTransform();
    flipAT.translate(0, pdPage.getBBox().getHeight());
    flipAT.scale(1, -1);

    // page may be rotated
    rotateAT = new AffineTransform();
    int rotation = pdPage.getRotation();
    if (rotation != 0)
    {
        PDRectangle mediaBox = pdPage.getMediaBox();
        switch (rotation)
        {
        case 90:
            rotateAT.translate(mediaBox.getHeight(), 0);
            break;
        case 270:
            rotateAT.translate(0, mediaBox.getWidth());
            break;
        case 180:
            rotateAT.translate(mediaBox.getWidth(), mediaBox.getHeight());
            break;
        default:
            break;
        }
        rotateAT.rotate(Math.toRadians(rotation));
    }

    // cropbox
    transAT = AffineTransform.getTranslateInstance(-cropBox.getLowerLeftX(), cropBox.getLowerLeftY());

    g2d = image.createGraphics();
    g2d.setStroke(new BasicStroke(0.1f));
    g2d.scale(SCALE, SCALE);

    setStartPage(page + 1);
    setEndPage(page + 1);

    Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
    writeText(document, dummy);

    // beads in green
    g2d.setStroke(new BasicStroke(0.4f));
    List<PDThreadBead> pageArticles = pdPage.getThreadBeads();
    for (PDThreadBead bead : pageArticles)
    {
        if (bead == null)
        {
            continue;
        }
        PDRectangle r = bead.getRectangle();
        Shape s = r.toGeneralPath().createTransformedShape(transAT);
        s = flipAT.createTransformedShape(s);
        s = rotateAT.createTransformedShape(s);
        g2d.setColor(Color.green);
        g2d.draw(s);
    }

    g2d.dispose();

    String imageFilename = filename;
    int pt = imageFilename.lastIndexOf('.');
    imageFilename = imageFilename.substring(0, pt) + "-marked-" + (page + 1) + ".png";
    ImageIO.write(image, "png", new File(imageFilename));
}

And here is the main section of the program where stripPage() is invoked. The PDF is opened and each page processed using stripPage().

    PDDocument document = null;
    try {
    document = PDDocument.load(new File(args[0]));
    ShowTextLocation stripper = new ShowTextLocation(document, args[0]);
    stripper.setSortByPosition(true);
    for (int page = 0; page < document.getNumberOfPages(); ++page) {
        stripper.stripPage(page);
    }
    } finally {
        if (document != null) document.close();
    }

This code was derived from the pdfbox example DrawPrintTextLocations

Once you run this example, you should get a PNG file which shows the text bounding boxes. You will also get the text along with locations on the console. Do with it as you please.

Source Code

The source code for this example is published on github. Get it here.

Conclusion

In this article, we learned how to parse a PDF file using Apache PDFBox. Apache PDFBox is an amazing library which helps in taming the hard-to-handle PDF file format. We learnt how to process a PDF file to extract text and the text locations from a PDF file.