Parsing a PDF file in Java
Parsing a PDF file is sometimes necessary to extract information from it or maybe for automation. In this article we learn how to parse a PDF file in java using the well-known apache pdfbox library.
Extracting Text from a PDF
The first example we are going to tackle is to extract and show the text from a PDF file using pdfbox. We are going to print just the text without any style or location information. This example serves to get us started using the pdfbox library.
Maven build file
Add the following maven dependency declaration to build the example with pdfbox.
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.29</version>
</dependency>
Loading the PDF file
To load the PDF file, we use the static
method of PDDocument
class
as follows:
Number of pages in the PDF
Let us check the number of pages in the PDF file as follows:
Creating a text stripper
The main class used to extract text from the PDF is known as
PDFTextStripper
. It has a number of methods which get called back
while processing the PDF. Intiialize the instance like this.
Looping over the pages
A PDF file may have multiple pages. Here is how you can loop over each page.
for (int np = 1 ; np <= npages ; np++) {
stripper.setStartPage(np);
stripper.setEndPage(np);
// process the page here
}
Extracting text from the page
We can now extract the text using the method
PDFTextStripper.getText()
and print it.
for (int np = 1 ; np <= npages ; np++) {
stripper.setStartPage(np);
stripper.setEndPage(np);
String text = stripper.getText(doc);
System.out.printf("Page %d\n", np);
System.out.println(text);
}
After processing the document, we need to close the document to free up resources.
And that is how easy it is to extract text from a PDF. By storing the
extracted text, you can provide a simple search functionality by
searching through it using something like Lucene
.
Extracting Text Locations from PDF
Let us now see how to extract text along with location of the text on
the page using pdfbox. The setup is similar to the above in that we
extend PDFTextStripper
class and override the
PDFTextStripper.writeString
method to get the text locations.
To that end, we create extend PDFTextStripper
in a class called
ShowTextLocation
as follows:
The program creates an image from the PDF consisting of just the text with all the text transformations applied - the font style, weight, size, rotation, shear, etc. Each text element is then outlined to show the bounding box so you can check the correctness of the coordinates extracted.
We override the writeString()
method and process each TextPosition
element with this method. Here we basically apply a series of
transformations to each text element. After rendering it onto an
image, we extract the locations and sizes.
@Override
protected void writeString(String string, List<TextPosition> textPositions) throws java.io.IOException
{
for (TextPosition text : textPositions) {
AffineTransform at = text.getTextMatrix().createAffineTransform();
Rectangle2D.Float rect = new Rectangle2D.Float(0, 0,
text.getWidthDirAdj() / text.getTextMatrix().getScalingFactorX(),
text.getHeightDir() / text.getTextMatrix().getScalingFactorY());
Shape s = at.createTransformedShape(rect);
s = flipAT.createTransformedShape(s);
s = rotateAT.createTransformedShape(s);
g2d.setColor(Color.red);
g2d.draw(s);
PDFont font = text.getFont();
BoundingBox bbox = font.getBoundingBox();
float xadvance = font.getWidth(text.getCharacterCodes()[0]); // todo: should iterate all chars
rect = new Rectangle2D.Float(0, bbox.getLowerLeftY(), xadvance, bbox.getHeight());
if (font instanceof PDType3Font) {
at.concatenate(font.getFontMatrix().createAffineTransform());
} else {
at.scale(1/1000f, 1/1000f);
}
s = at.createTransformedShape(rect);
s = flipAT.createTransformedShape(s);
s = rotateAT.createTransformedShape(s);
g2d.setColor(Color.blue);
g2d.draw(s);
Rectangle2D bounds = s.getBounds2D();
System.out.printf("[%s] (%f, %f, %f %f)\n", text.toString(),
bounds.getX(), bounds.getY(), bounds.getWidth(), bounds.getHeight());
}
}
The code for creating the image on which the text is drawn with the
bounding boxes is shown in the following method stripPage()
which is
invoked while processing each page of the PDF.
private void stripPage(int page) throws java.io.IOException
{
PDFRenderer pdfRenderer = new PDFRenderer(document);
BufferedImage image = pdfRenderer.renderImage(page, SCALE);
PDPage pdPage = document.getPage(page);
PDRectangle cropBox = pdPage.getCropBox();
// flip y-axis
flipAT = new AffineTransform();
flipAT.translate(0, pdPage.getBBox().getHeight());
flipAT.scale(1, -1);
// page may be rotated
rotateAT = new AffineTransform();
int rotation = pdPage.getRotation();
if (rotation != 0)
{
PDRectangle mediaBox = pdPage.getMediaBox();
switch (rotation)
{
case 90:
rotateAT.translate(mediaBox.getHeight(), 0);
break;
case 270:
rotateAT.translate(0, mediaBox.getWidth());
break;
case 180:
rotateAT.translate(mediaBox.getWidth(), mediaBox.getHeight());
break;
default:
break;
}
rotateAT.rotate(Math.toRadians(rotation));
}
// cropbox
transAT = AffineTransform.getTranslateInstance(-cropBox.getLowerLeftX(), cropBox.getLowerLeftY());
g2d = image.createGraphics();
g2d.setStroke(new BasicStroke(0.1f));
g2d.scale(SCALE, SCALE);
setStartPage(page + 1);
setEndPage(page + 1);
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
writeText(document, dummy);
// beads in green
g2d.setStroke(new BasicStroke(0.4f));
List<PDThreadBead> pageArticles = pdPage.getThreadBeads();
for (PDThreadBead bead : pageArticles)
{
if (bead == null)
{
continue;
}
PDRectangle r = bead.getRectangle();
Shape s = r.toGeneralPath().createTransformedShape(transAT);
s = flipAT.createTransformedShape(s);
s = rotateAT.createTransformedShape(s);
g2d.setColor(Color.green);
g2d.draw(s);
}
g2d.dispose();
String imageFilename = filename;
int pt = imageFilename.lastIndexOf('.');
imageFilename = imageFilename.substring(0, pt) + "-marked-" + (page + 1) + ".png";
ImageIO.write(image, "png", new File(imageFilename));
}
And here is the main section of the program where stripPage()
is
invoked. The PDF is opened and each page processed using
stripPage()
.
PDDocument document = null;
try {
document = PDDocument.load(new File(args[0]));
ShowTextLocation stripper = new ShowTextLocation(document, args[0]);
stripper.setSortByPosition(true);
for (int page = 0; page < document.getNumberOfPages(); ++page) {
stripper.stripPage(page);
}
} finally {
if (document != null) document.close();
}
This code was derived from the pdfbox example
DrawPrintTextLocations
Once you run this example, you should get a PNG file which shows the text bounding boxes. You will also get the text along with locations on the console. Do with it as you please.
Source Code
The source code for this example is published on github. Get it here.
Conclusion
In this article, we learned how to parse a PDF file using Apache PDFBox. Apache PDFBox is an amazing library which helps in taming the hard-to-handle PDF file format. We learnt how to process a PDF file to extract text and the text locations from a PDF file.