Text not extracted Properly

Vijayalakshmisridharan · April 2, 2024, 10:06am

Am using Aspose.PDF to extract text from by using Textabsorber. But the extracted text is having lot of unwanted text. Attaching the extracted Text and the original PDF. Below is the code being used in c#.

private static string ExtractTextFromPDF([FromForm] FileUploadRequest convertRequest)
{
try
{
if (convertRequest.FileData != null && convertRequest.FileData.Length > 0)
{
var pdfDocument = new Document(new MemoryStream(convertRequest.FileData));

         // Create TextAbsorber object to extract text
         var textAbsorber = new TextAbsorber();
         // Accept the absorber for all the pages
         pdfDocument.Pages.Accept(textAbsorber);
         // Get the extracted text
         var extractedText = textAbsorber.Text;
         return extractedText;
     }
     else
     {
         return string.Empty;
     }
 }
 catch (Exception ex)
 {
     throw;

 }

}
lowes-2022ar-full-report-4-6-23-final.pdf (3.2 MB)

extractedText.zip (115.3 KB)

asad.ali · April 2, 2024, 6:31pm

@Vijayalakshmisridharan

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56945

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Vijayalakshmisridharan · April 5, 2024, 1:03pm

Hi team,
Any update on this. We do have many PDF with images which successfully extracted text but why exceptionally we got lot of junk characters in this pdf alone.

Is there any way/ config to find if the PDF has SVG/ graphical images and we can skip the conversion of them. It will be helpful if you can reply us soon.

asad.ali · April 6, 2024, 12:00am

@Vijayalakshmisridharan

The issue may be related to the encoding. However, we are afraid that we cannot comment much on the situation because the ticket is not yet investigated. We will prioritize it on first come first serve basis and as soon as we have some updates, we will share with you. Please be patient and spare us some time.

We are sorry for the inconvenience.

Vijayalakshmisridharan · April 25, 2024, 12:58pm

Hi,
I did see that the original pdf itself is having some kind of non-printable characters that is causing the issue. Something like this "H5H9A9BHGC:H<9F9;=GHF5BH=B7@I898=BH<9:=@=B;F9:@97H. ". Is there any inbuild API service that Aspose is providing to check if the file is valid file or not.

asad.ali · April 25, 2024, 6:21pm

@Vijayalakshmisridharan

We believe that your above concerns are being addressed in the another thread which you started. In case you need further information, you can continue the discussion there as well.