Reading Data From A PDF File In Java

Hi Guys,

In this blog, I am going to explain you about how to read text data from a pdf file.

For this, you have to add a maven dependency called Apache PDFBox. The Apache PDFBox library is an open source Java library which is used to work with the PDF documents. This library can be used for creating any new PDF documents or manipulation of any existing documents and it also provides ability to extract the content of the documents.

So let's start with reading the data from the pdf file. First add the maven dependency like:

<dependency>
	<groupId>org.apache.pdfbox</groupId>
	<artifactId>pdfbox</artifactId>
	<version>2.0.6</version>
</dependency>

Then I use below code to read the text from the pdf file:

public static void main(String[] args) throws IOException {
	try (PDDocument pdDocument = PDDocument.load(new File("/file_path/fileToRead.pdf"))) {
		pdDocument.getClass();
		PDFTextStripperByArea pdfTextStripperByArea = new PDFTextStripperByArea();
		pdfTextStripperByArea.setSortByPosition(true);
		PDFTextStripper pdfTextStripper = new PDFTextStripper();
		String textInPDFFile = pdfTextStripper.getText(pdDocument);
		String textLines[] = textInPDFFile.split("\\r?\\n");
		for (String textLine : textLines) {
			System.out.println(textLine);
		}
	}
}

In the above code setSortByPosition method is used because the order of the text tokens in a PDF file may not be the same as they appear visually on the screen.

Hope it helps!