Extraction of text from pdf is not proper. Getting spacing issue

karthi988 · February 8, 2021, 5:37pm

Hi,
When I tried to extract the text from the pdf sample attached here, there is a spacing issue present in the extracted text. Please let me know a solution for this issue and the reason behind it.

The sample: testSample.pdf (120.6 KB)

The extractedText: page2_extracted.zip (1.8 KB)

This the page 2 of the sample being extracted.
Thankyou.

asad.ali · February 9, 2021, 5:02am

@karthi988

We were able to replicate the issue in our environment during testing the scenario with Aspose.PDF for .NET 21.1. Therefore, an issue as PDFNET-49391 has been logged in our issue tracking system for the sake of investigation. We will look into its details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.

karthi988 · February 9, 2021, 5:04am

Hi @asad.ali,
I am currently using aspose.pdf for JAVA 21.1. Can you please check with java version also.
Thankyou.

karthi988 · February 9, 2021, 10:42am

Hi,
I have attached another sample here for your reference. Getting the same spacing issue for this sample too.
sample2.pdf (105.5 KB)
And may I know the estimated time for resolving this ticket as we have requirement for this.
Thankyou.

asad.ali · February 9, 2021, 8:12pm

@karthi988

We have updated the ticket information and new ID is PDFJAVA-40151.

Another ticket with the ID PDFJAVA-40152 has been logged for your second PDF file. We will investigate it as well and let you know as soon as it is resolved.

These tickets have been logged under normal support model and they will be investigated/fixed on a first come first serve basis. We are afraid that we cannot share any timeline at the moment as there are other issues which were logged prior to them. However, we will inform you as soon as we make some definite progress towards resolution of these tickets. Please give us some time.

We apologize for the inconvenience.

asad.ali · March 26, 2021, 12:16am

@karthi988

We have investigated the both logged tickets and found that:

This is not a bug, but default text extraction behavior that represent pdf content in text with a bit of formatting routines. (default option TextExtractionOptions.TextFormattingMode.Pure)
You can use the other option to get text content as is, i.e. without formatting using the following parameter:

//java 
textAbsorber.getExtractionOptions().setFormattingMode(TextExtractionOptions.TextFormattingMode.Raw);
 //.NET 
textAbsorber.ExtractionOptions.FormattingMode = Pdf.Text.TextExtractionOptions.TextFormattingMode.Raw;

OR use the scale of the text to correct its appearance by reducing or enlarging ScaleFactor (default scale is 1.0)

//java 
textAbsorber.getExtractionOptions().setScaleFactor(1.5); //using this value will give less number of spaces in text 
//.NET 
textAbsorber.ExtractionOptions.ScaleFactor = 1.5;