Hi Maria,
We are sorry for the inconvenience caused. While testing the scenario with the latest version of Aspose.Pdf for .NET 9.4.0, we have managed to reproduce the reported issue and logged it in our bug tracking system as PDFNEWNET-37262 for further investigation and resolution. We will notify you via this thread as soon as it is resolved.
Please feel free to contact us for any further assistance.
Best Regards,
Ran into same issue with version 9.0.0: Absorber.TextFragments traversal returns the fragments containing ‘\0’ character instead of space. It appears that all fragments in the document have nulls instead of spaces, not just sporadic occurrences.
Sorry, cannot submit pdf sample due to info sensitivity. Image of xhtml code with exact fragment text is attached.
Investigated correlation to pdf production stack and pdf version (see list below), but didn’t see any direct correlation.
Are there any indications whether this might be pdf generation issue or Aspose pdf parsing issue?
1.4 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: Neevia docuPrinter Pro v6.0 (http://neevia.com))
1.5 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: Neevia docuPrinter Pro v6.0 (http://neevia.com))
1.4 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: Amyuni PDF Converter version 4.5.2.5)
1.5 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: þÿ1 1.4 PDFsharp 1.31.1789-g (www.pdfsharp.com)
mattshelton:Ran into same issue with version 9.0.0: Absorber.TextFragments traversal returns the fragments containing ‘\0’ character instead of space. It appears that all fragments in the document have nulls instead of spaces, not just sporadic occurrences.Sorry, cannot submit pdf sample due to info sensitivity. Image of xhtml code with exact fragment text is attached.
Investigated correlation to pdf production stack and pdf version (see list below), but didn’t see any direct correlation.
Are there any indications whether this might be pdf generation issue or Aspose pdf parsing issue?
1.4 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: Neevia docuPrinter Pro v6.0 (http://neevia.com))
1.5 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: Neevia docuPrinter Pro v6.0 (http://neevia.com))
1.4 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: Amyuni PDF Converter version 4.5.2.5)
1.5 - PDFsharp 1.31.1789-g (www.pdfsharp.com) (Original: þÿ1 1.4 PDFsharp 1.31.1789-g (www.pdfsharp.com)
The reason you are getting the “\0” character is that the font defined in the document does not contain the definition of the characters in the extracted string. It’s actually possible to remove this symbol, but it will be the same as yours and you won’t get performance priority. It’s better for you to remove this symbol on your side because removing the “\0” character breaks rules and logical structure of the content extraction.