Extracting Text from Multilayer PDF

mark_hodges · August 26, 2013, 12:00pm

I am trying to extract text from a multilayer PDF. Does Aspose.PDF support this? The document has an image with text placed in the document by position. There are no form fields within the document.

codewarior · August 27, 2013, 1:51am

Hi Mark,

Thanks for contacting support.

Do you mean the text is placed under the image or the text is placed over image placed inside PDF document. Can you please share the sample PDF file so that we can test the scenario at our end. We apologize for this inconvenience.

mark_hodges · August 27, 2013, 7:00am

I am not sure if the text is in front or back as they are created by a vendor. I have attached a sample pdf. Thank you for your assistance.

codewarior · August 28, 2013, 5:58am

Hi Mark,

Thanks for sharing the resource file.

I
have tested the scenario and I am able to reproduce the same problem that text is not properly being extract and also not all text is being extracted. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWNET-35734. We
will investigate this issue in details and will keep you updated on the status
of a correction. <o:p></o:p>

We apologize for your inconvenience.

mark_hodges · September 24, 2013, 7:35am

Is there any update on this issue?

tilal.ahmad · September 24, 2013, 11:14pm

Hi Mark,

Thanks for your inquiry. I’m afraid the reported issue is still not resolved, its pending for investigation in the queue with other priority tasks. We’ve requested our development team to share an ETA and will update you as soon as we get a feedback.

We are sorry for the inconvenience caused.

Best Regards,

mark_hodges · March 12, 2014, 9:14am

Any update on this issue?

tilal.ahmad · March 12, 2014, 10:37pm

Hi mark,

We are sorry for the inconvenience. I am afraid your reported issue is still not resolved, the development team is looking into other priority tasks. We have already requested the team to share an ETA at their earliest. We will update you as soon as we made some significant progress towards issue resolution.

Thanks for your patience and cooperation.

Best Regards,

mark_hodges · April 24, 2014, 7:11am

I am at a point where I need to have a resolution to this problem or find an alternate method for handling this. If you believe this may have a resolution soon I will wait another month or two at the most. I am not trying to be impatient however I need to start working on an alternate solution if this issue will not be resolved soon.

tilal.ahmad · April 24, 2014, 10:33pm

Hi Mark,

Thanks for your feedback. I am afraid this issue is still not resolved due to other priority tasks and issue complexity. However, we have recorded your concern and shared with the development team as well. We also requested the team to share an ETA at their earliest. As soon as we get a feedback we will update you.

We are sorry for the inconvenience caused.

Best Regards,

aspose.notifier · June 5, 2014, 2:18pm

The issues you have found earlier (filed as PDFNEWNET-35734) have been fixed in Aspose.Pdf for .NET 9.3.0.

Blog post for this release can be viewed over this link

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

tilal.ahmad · June 23, 2014, 12:09pm

Hi Mark,

Thank for your patience. As stated above your reported issue has been fixed in Aspose.Pdf for .NET 9.3.0. Please use following code snippet to extract data from multi-layer PDF, it will help you to accomplish the task.

Document pdfDocument
= new Document(“c:/pdftest/MarkH.pdf”);<o:p></o:p>

string extractedText;

StringBuilder resultBuilder = new StringBuilder();

foreach (Page pdfPage in pdfDocument.Pages)

{

using (MemoryStream textStream = new MemoryStream())

{

TextDevice textDevice = new TextDevice();

TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);

textDevice.ExtractionOptions = textExtOptions;

textDevice.Process(pdfPage, textStream);

textStream.Close();

extractedText = Encoding.UTF8.GetString(textStream.ToArray());

}

resultBuilder.Append(extractedText);

}

File.WriteAllText("output.txt", resultBuilder.ToString());

Please feel free to contact us for any further assistance.

Best Regards,

mark_hodges · June 24, 2014, 1:36pm

Using the code you provided, I get extracted text but it is blank. Do you know any reason why your supplied code does not return any text?

codewarior · June 25, 2014, 2:04am

Hi Mark,

I have tested the scenario using earlier shared code snippet and as per my observations, the text is being extracted from PDF file. For your reference, I have also attached the text file containing extracted text.

However not all PDF contents are extracted with this technique because the input file also contains images. In order to get the contents of image, please convert the pages to image format using Aspose.Pdf for .NET and then perform OCR on extracted images using Aspose.OCR. For more information, please visit

mark_hodges · June 25, 2014, 6:47am

I am not finding the same results. I am using Aspose.Pdf 9.3.0 and VB.Net framework 2.0. I have attached the test code I used to test. Only the references to Aspose.Pdf.Text.TextAbsorber extract text. The attached code contains 3 Private Functions: extractMemberInfo, extractMemberID and verifyMemberNameOnPage. extractMemberInfo using the code you provided does not extract the text from the Pdf page.

The functions are called within a For Each pdfPage as Page in pdfDocument.Pages loop.

codewarior · June 26, 2014, 6:25am

Hi Mark,

Thanks for sharing the details.

I have tested the scenario using code snippet which you have shared and as per my observations, no text is being extracted when using extractMemberInfo(…) method. However when saving the extracted text in text file inside the extractMemberInfo(…) method, the extracted contents are properly being added.

It seems to be an issue while returning the data/contents from extractMemberInfo(…) method.

[VB.NET]

Private Function extractMemberInfo(ByVal pdfpage As Page) As String<o:p></o:p>

Dim resultBuilder As System.Text.StringBuilder = New System.Text.StringBuilder()

Dim extractedText As String = String.Empty

Using textStream As New MemoryStream()

Dim textDevice As New TextDevice()

Dim textExtOptions As New TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure)

textDevice.ExtractionOptions = textExtOptions

textDevice.Process(pdfpage, textStream)

textStream.Close()

extractedText = System.Text.Encoding.UTF8.GetString(textStream.ToArray())

End Using

resultBuilder.Append(extractedText)

File.WriteAllText("c:/pdftest/MarkH_output_VB.txt", resultBuilder.ToString())

Return extractedText

End Function