I am trying to extract text from a multilayer PDF. Does Aspose.PDF support this? The document has an image with text placed in the document by position. There are no form fields within the document.
Hi Mark,
I am not sure if the text is in front or back as they are created by a vendor. I have attached a sample pdf. Thank you for your assistance.
Hi Mark,
I
have tested the scenario and I am able to reproduce the same problem that text is not properly being extract and also not all text is being extracted. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWNET-35734. We
will investigate this issue in details and will keep you updated on the status
of a correction. <o:p></o:p>
We apologize for your inconvenience.
Is there any update on this issue?
Hi Mark,
Any update on this issue?
Hi mark,
I am at a point where I need to have a resolution to this problem or find an alternate method for handling this. If you believe this may have a resolution soon I will wait another month or two at the most. I am not trying to be impatient however I need to start working on an alternate solution if this issue will not be resolved soon.
Hi Mark,
The issues you have found earlier (filed as PDFNEWNET-35734) have been fixed in Aspose.Pdf for .NET 9.3.0.
Blog post for this release can be viewed over this link
This message was posted using Notification2Forum from Downloads module by Aspose Notifier.
Hi Mark,
Document pdfDocument
= new Document(“c:/pdftest/MarkH.pdf”);<o:p></o:p>
string extractedText;
StringBuilder resultBuilder = new StringBuilder();
foreach (Page pdfPage in pdfDocument.Pages)
{
using (MemoryStream textStream = new MemoryStream())
{
TextDevice textDevice = new TextDevice();
TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
textDevice.ExtractionOptions = textExtOptions;
textDevice.Process(pdfPage, textStream);
textStream.Close();
extractedText = Encoding.UTF8.GetString(textStream.ToArray());
}
resultBuilder.Append(extractedText);
}
File.WriteAllText("output.txt", resultBuilder.ToString());
Please feel free to contact us for any further assistance.
Best Regards,
Using the code you provided, I get extracted text but it is blank. Do you know any reason why your supplied code does not return any text?
Hi Mark,
I have tested the scenario using earlier shared code snippet and as per my observations, the text is being extracted from PDF file. For your reference, I have also attached the text file containing extracted text.
However not all PDF contents are extracted with this technique because the input file also contains images. In order to get the contents of image, please convert the pages to image format using Aspose.Pdf for .NET and then perform OCR on extracted images using Aspose.OCR. For more information, please visit
I am not finding the same results. I am using Aspose.Pdf 9.3.0 and VB.Net framework 2.0. I have attached the test code I used to test. Only the references to Aspose.Pdf.Text.TextAbsorber extract text. The attached code contains 3 Private Functions: extractMemberInfo, extractMemberID and verifyMemberNameOnPage. extractMemberInfo using the code you provided does not extract the text from the Pdf page.
The functions are called within a For Each pdfPage as Page in pdfDocument.Pages loop.
Hi Mark,
Private Function extractMemberInfo(ByVal pdfpage As Page) As String<o:p></o:p>
Dim resultBuilder As System.Text.StringBuilder = New System.Text.StringBuilder()
Dim extractedText As String = String.Empty
Using textStream As New MemoryStream()
Dim textDevice As New TextDevice()
Dim textExtOptions As New TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure)
textDevice.ExtractionOptions = textExtOptions
textDevice.Process(pdfpage, textStream)
textStream.Close()
extractedText = System.Text.Encoding.UTF8.GetString(textStream.ToArray())
End Using
resultBuilder.Append(extractedText)
File.WriteAllText("c:/pdftest/MarkH_output_VB.txt", resultBuilder.ToString())
Return extractedText
End Function