Aspose PDF reading special characters

murrman95 · May 25, 2020, 11:58am

Hello, for a project that I’m working on, we’re extracting text blocks from pdfs using the ParagraphAbsorber class. We’re encountering a problem where special characters are being read as NULL (u\0000), for instance, in 14699080.pdf (419.8 KB), the title “Impaired Development of CD4+ CD25+ Regulatory T Cells in the Absence of STAT1”, the ‘+’ characters are being read as null.

Here is my code sample,
Using Aspose 18.5 with Aspose for .Net.
Also the JSON library I’m using is Newtonsoft.Json.Linq

private JArray getCharSequence(MarkupParagraph markupParagraph)
            {
                JArray array = new JArray();
                foreach (TextFragment textFragment in markupParagraph.Fragments)
                {
                    foreach (TextSegment textSegment in textFragment.Segments)
                    {
                        string text = textSegment.Text;
                        for (int i = 0; i < textSegment.Characters.Count; i++)
                        {
                            try {
                                CharInfo cInfo = textSegment.Characters[i + 1];
                                JArray charInfo = new JArray();
                                charInfo.Add(new JValue(cInfo.Position.YIndent));
                                charInfo.Add(new JValue(cInfo.Position.XIndent));
                                charInfo.Add(new JValue(cInfo.Rectangle.Height));
                                charInfo.Add(new JValue(cInfo.Rectangle.Width));

                                // This is where we are adding the character
                                charInfo.Add(new JValue(text[i]));
                                array.Add(charInfo);
                            }
                            catch (Exception e)
                            {
                                throw e;
                            }
                        }
                    }
                }
                return array;
            }

Please let me know what else you need to know to help. I’ve been looking through the documentation for anything about text encoding settings for Aspose, but nothing has shown up so far.

asad.ali · May 26, 2020, 6:48pm

@murrman95

We have tested the scenario in our environment using Aspose.PDF for .NET 20.5 and noticed that special characters were not present in the console output. Therefore, we have logged an issue as PDFNET-48240 in our issue tracking system. We will further look into this and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.