Absorber is entering Hex (0x00) instead of space

cadoria · July 27, 2014, 5:16pm

Hello,

Aspose.Pdf'm using (9.4.0) and reading the pdf attached an insert of hexadecimal 0x00 instead of a space at positions 572 and 573 occurs while reading page 30.

This character 0x00 (null) generates me problems when inserting into the database.

At the moment I am having to read byte by byte 0x00 demand this and remove it. This is very time consuming and processing.

The snippet code I'm using is:

textAbsorber = new TextAbsorber();

textAbsorber.TextSearchOptions.LimitToPageBounds = true;

textAbsorber.TextSearchOptions.Rectangle = new Aspose.Pdf.Rectangle(0, 15, 300, 787);

pdfDocument.Pages[page].Accept(textAbsorber);

linha = textAbsorber.Text;

To identify the 0x00 I used:

char[] arr = ((string)linha).ToCharArray();

for (int i = 0; i < arr.Length; i++){

if (arr[i] == '\0'){

string x = "Err!! ";

}

tilal.ahmad · July 29, 2014, 2:07am

Hi Maria, We are sorry for the inconvenience caused. While testing the scenario with the latest version of Aspose.Pdf for .NET 9.4.0, we have managed to reproduce the reported issue and logged it in our bug tracking system as PDFNEWNET-37262 for further investigation and resolution. We will notify you via this thread as soon as it is resolved.

Please feel free to contact us for any further assistance.

Best Regards,

mattshelton · August 1, 2014, 3:52pm

Ran into same issue with version 9.0.0: Absorber.TextFragments traversal returns the fragments containing ‘\0’ character instead of space. It appears that all fragments in the document have nulls instead of spaces, not just sporadic occurrences.

Sorry, cannot submit pdf sample due to info sensitivity. Image of xhtml code with exact fragment text is attached.

Investigated correlation to pdf production stack and pdf version (see list below), but didn’t see any direct correlation.

Are there any indications whether this might be pdf generation issue or Aspose pdf parsing issue?

1.4 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: Neevia docuPrinter Pro v6.0 (http://neevia.com))

1.5 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: Neevia docuPrinter Pro v6.0 (http://neevia.com))

1.4 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: Amyuni PDF Converter version 4.5.2.5)

1.5 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: þÿ1 1.4 PDFsharp 1.31.1789-g (www.pdfsharp.com)

codewarior · August 4, 2014, 6:12am

mattshelton:
Ran into same issue with version 9.0.0: Absorber.TextFragments traversal returns the fragments containing ‘\0’ character instead of space. It appears that all fragments in the document have nulls instead of spaces, not just sporadic occurrences.

Sorry, cannot submit pdf sample due to info sensitivity. Image of xhtml code with exact fragment text is attached.

Investigated correlation to pdf production stack and pdf version (see list below), but didn’t see any direct correlation.

Are there any indications whether this might be pdf generation issue or Aspose pdf parsing issue?

1.4 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: Neevia docuPrinter Pro v6.0 (http://neevia.com))
1.5 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: Neevia docuPrinter Pro v6.0 (http://neevia.com))
1.4 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: Amyuni PDF Converter version 4.5.2.5)
1.5 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: þÿ1 1.4 PDFsharp 1.31.1789-g (www.pdfsharp.com)

Hi Matt,

Thanks for contacting support.

The earlier reported problem is logged in our issue tracking system and development team will investigate this issue as per their schedule. However concerning to the particular scenario in which you are noticing above stated issue, we request you to please share the resource file so that we can individually investigate this problem. Please note that all the documents/files/resources shared by our customers are only used for testing purposes and they are removed, once particular problem is resolved. In case you are not comfortable while sharing the document in this thread, you may consider directly sending us the document. Please follow the instructions specified over How to send a license?

We are sorry for this inconvenience.

asad.ali · January 19, 2022, 8:13pm

@cadoria

The reason you are getting the “\0” character is that the font defined in the document does not contain the definition of the characters in the extracted string. It’s actually possible to remove this symbol, but it will be the same as yours and you won’t get performance priority. It’s better for you to remove this symbol on your side because removing the “\0” character breaks rules and logical structure of the content extraction.