Hello, for a project that I’m working on, we’re extracting text blocks from pdfs using the ParagraphAbsorber class. We’re encountering a problem where special characters are being read as NULL (u\0000), for instance, in 14699080.pdf (419.8 KB), the title “Impaired Development of CD4+ CD25+ Regulatory T Cells in the Absence of STAT1”, the ‘+’ characters are being read as null.
Here is my code sample,
Using Aspose 18.5 with Aspose for .Net.
Also the JSON library I’m using is Newtonsoft.Json.Linq
private JArray getCharSequence(MarkupParagraph markupParagraph)
{
JArray array = new JArray();
foreach (TextFragment textFragment in markupParagraph.Fragments)
{
foreach (TextSegment textSegment in textFragment.Segments)
{
string text = textSegment.Text;
for (int i = 0; i < textSegment.Characters.Count; i++)
{
try {
CharInfo cInfo = textSegment.Characters[i + 1];
JArray charInfo = new JArray();
charInfo.Add(new JValue(cInfo.Position.YIndent));
charInfo.Add(new JValue(cInfo.Position.XIndent));
charInfo.Add(new JValue(cInfo.Rectangle.Height));
charInfo.Add(new JValue(cInfo.Rectangle.Width));
// This is where we are adding the character
charInfo.Add(new JValue(text[i]));
array.Add(charInfo);
}
catch (Exception e)
{
throw e;
}
}
}
}
return array;
}
Please let me know what else you need to know to help. I’ve been looking through the documentation for anything about text encoding settings for Aspose, but nothing has shown up so far.