Extract text from every page (page by page)

MathiasH · August 19, 2011, 5:38am

Hi,

I need to extract the texts from each page of a pdf and have some issues with it:

(1) there seems to be no option to access the page count, therefore an additional call to PdfFileInfo is necessary - can't this be made available by PdfExtractor itself? Or is there a more elegant solution that I currently don't see? The .NET-Version has a kind of iterator for the pages (as I saw in the forums), but my Java PDF.Kit does not offer such thing (I'm using the current 3.9 version).

(2) PdfExtractor obviously does not close the PDF file completely, therefore the following step in my application (deleting the PDF) fails. I tried to call "close" on the extractor object, but this doesn't help.

(3) PdfExtractor seems to allow only one "extractText" call - if I call it more than once, I get errors about an invalid PDF header (?). But if I create a new PDFExtractor for each page, everything works fine (but takes hours).

(4) What Aspose component would you suggest to use in more complex scenarios where one would like to extract text, annotations, bookmarks and modify document information and content? Is it really an "expert" solution to have a separate tool for each of these tasks which all need to parse the PDF file again? And don't think of 2 KB PDFs, think of 100MB+.

My code for the text extraction problem up to now:

try {
PdfFileInfo pdfInfo=new PdfFileInfo(pdffile);
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(pdffile);
for (int i=0;i<pdfInfo.getNumberofPages();i++) {
extractor.setStartPage(i);
extractor.setEndPage(i);
extractor.extractText();
for (TextSegment seg: extractor.getFormattedText()) {
//do something
}
}
extractor.close();
} catch (Throwable t) {
t.printStackTrace();
}

Kind regards,
Mathias Harbeck

shahzadlatif · August 19, 2011, 11:07am

Hi Mathias,

I have logged a new feature request as PDFKITJAVA-29988 to support the iteration through each page and extract text page by page. You’ll be notified via this forum thread once it is supported.

As far as point 2 and 3 are concerned, could you please share some sample PDF file causing these issues, so we could investigate the issue at our end.

Regarding your last point, you only need Aspose.Pdf.Kit to edit an existing PDF files. You can extract text, bookmarks, and annotations etc using this single component. I would also like to share that it can process files quite bigger than 100 MB. It’s only that sometimes due to certain type of content or structure of the PDF file the issue might occur, but usually it works fine with PDF files large or small.

Regards,

MathiasH · August 21, 2011, 3:13am

Hi Shahzad,

regarding points 2 and 3: I’m currently not in my office, but I’ll try to provide a complete scenario by tomorrow.

regarding point 4: I’m quite sure that your tools are capable of loading large files - what annoys me is the fact that for each task (read PDF information, read annotations, read content, read bookmarks, modify content, modify bookmarks,…) I have to use a separate PDF-Editor class from PDF.Kit and each of those editors parses the file again (as far as I understand). Besides being relatively uncomfortable this seems to be a performance problem to me… (which is why I mentioned large files)

Regards,
Mathias

MathiasH · August 22, 2011, 12:29am

Hi Shahzad,

here is the example for point 3. PDF has been generated for this test case by Open Office Writer and contains only two pages with simple content.

The attached java code will show the error. It follows mostly the source code posted earlier.

Output is:

reading page 1
reading page 2
java.io.IOException: PDF header signature error.
at com.aspose.pdf.kit.ky.k(Unknown Source)
at com.aspose.pdf.kit.oi.a(Unknown Source)
at com.aspose.pdf.kit.oi.(Unknown Source)
at com.aspose.pdf.kit.oi.(Unknown Source)
at com.aspose.pdf.kit.cx.a(Unknown Source)
at com.aspose.pdf.kit.PdfExtractor.extractText(Unknown Source)
at point3.main(point3.java:23)
java.lang.NullPointerException
at com.aspose.pdf.kit.cx.a(Unknown Source)
at com.aspose.pdf.kit.PdfExtractor.extractText(Unknown Source)
at point3.main(point3.java:23)

Regards,
Mathias Harbeck

MathiasH · August 22, 2011, 12:44am

Hi Shahzad,

and here comes the example for point 2. PDF is the same as for point 3, but the java code slightly changed (only looks on page 1 to avoid the exception, calls extractor.close() at the end and calls after all processing sourceFile.delete() to test whether the source file can be removed or not).

Output on my system:

Could not delete!

Regards,
Mathias

shahzadlatif · August 22, 2011, 12:27pm

Hi Mathias,

We’re looking into these issues and you’ll be updated with the results shortly.

Regards,

shahzadlatif · August 23, 2011, 6:40am

Hi Mathias,

Please find the answer to your queries below:

Point 2: I have reproduced this problem and logged this issue as PDFKITJAVA-30068.
Point 3: I have reproduced this problem and logged this issue as PDFKITJAVA-30067.

As far as point 4 is concerned, you’re right currently you’ll have to open and close the file multiple times using different classes. However, if we modify the API is future, we’ll let you know. I’m afraid, in the meanwhile, you may use the current API.

We’re sorry for the inconvenience.
Regards,

MathiasH · November 9, 2011, 12:52am

Hi,

when will these bugs be fixed?

Regards,
Mathias Harbeck

shahzadlatif · November 17, 2011, 7:30am

Hi Mathias,

I’m sorry to share with you that these issues are not yet resolved. However, I have increased the priority of these issues to high and asked our development team to share the ETA of these issues. You’ll be updated as soon as the response is received.

We’re sorry for the inconvenience and appreciate your cooperation.
Regards,

aspose.notifier · February 21, 2013, 10:04am

The issues you have found earlier (filed as PDFKITJAVA-30068) have been fixed in Aspose.Pdf.Kit for Java 4.5.0.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.