Problem Reading PDF containing TAB characters

louis.a · April 15, 2025, 10:40am

When working with this attached document, if I extract text position in a way similar to your sample code “HighlightCharacterInPDF”

                                        for (int charNum = 1; charNum <= segment.Characters.Count; charNum++)
                                        {
                                            CharInfo characterInfo = segment.Characters[charNum];

                                            Aspose.Pdf.Rectangle rect = page.GetPageRect(true);
                                            Console.WriteLine("TextFragment = " + textFragment.Text + "    Page URY = " + rect.URY +
                                                              "   TextFragment URY = " + textFragment.Rectangle.URY);

                                            gr.DrawRectangle(
                                            Pens.Black,
                                            (float)characterInfo.Rectangle.LLX,
                                            (float)characterInfo.Rectangle.LLY,
                                            (float)characterInfo.Rectangle.Width,
                                            (float)characterInfo.Rectangle.Height);
                                        }

I get the position of all characters, including the TAB (\011) character.
The problem is the Segment.Text doesn’t contain the TAB character or any placeholder. So there is a gap between the CharInfo collection and the characters in Text, and I can only guess what CharInfo correspond to what char if I am aware those tabs exists.

Also, if I try to convert this doc to SVG, the output SVG contains NULL characters and it can’t be opened in any viewer.

    public class PDFToSVG
    {
        public static void Run()
        {
            // ExStart:PDFToSVG
            // The path to the documents directory.
            string dataDir = RunExamples.GetDataDir_AsposePdf_DocumentConversion();

            // Load PDF document
            Document doc = new Document(dataDir + "cotiz_anonyme.pdf");
            // Instantiate an object of SvgSaveOptions
            SvgSaveOptions saveOptions = new SvgSaveOptions();
            // Do not compress SVG image to Zip archive
            saveOptions.CompressOutputToZipArchive = false;
            // Save the output in SVG files
            doc.Save(dataDir + "PDFToSVG_out.svg", saveOptions);
            // ExEnd:PDFToSVG
        }
    }

PDFToSVG_out.zip (838.3 KB)

sergei.shibanov · April 15, 2025, 3:07pm

@louis.a
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-59724

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

sergei.shibanov · April 15, 2025, 3:08pm

@louis.a
A task PDFNET-59724 has been created to fix for converting Pdf → Svg

sergei.shibanov · April 15, 2025, 3:26pm

@louis.a
Regarding the first part of the query: I used the following code to determine if Tab was present in the found text.

var pdfDocument = new Document(dataDir + "cotiz_anonyme.pdf");
for (int i = 1; i <= pdfDocument.Pages.Count; i++)
{
    Page page = pdfDocument.Pages[i];
    var textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+");
    textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;
    page.Accept(textFragmentAbsorber);

    TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

    foreach (TextFragment textFragment in textFragmentCollection)
    {
        if (textFragment.Text.Contains('\t'))
        {
            Console.WriteLine(textFragment.Text + " contains Tab");
        }
    }
}

According to it, Tab is missing in the found text. Please provide more detailed explanations if I misunderstood your query.

It would be better and more convenient to separate different questions into different topics.

louis.a · April 15, 2025, 3:46pm

Yes, that’s exactly the problem, the tab is not in the extracted text. I only know there is a tab because I checked the operators.

BT
1 0 0 1 68.15 751 Tm
/F2 9 Tf
(\011\011Acerta sociaal verzekeringsf) Tj
ET

The \011 is ignored in the Text property and turned into 0 in the SVG, witch cause a corrupted SVG.
I need both functions and just guessed the two problems have the same origin.

sergei.shibanov · April 15, 2025, 5:31pm

@louis.a
The page operators do indeed contain \011. According to Wikipedia, this corresponds to a vertical tab (0x0В).
Vertical tabulation in pdf looks ridiculous, in my opinion - it was created as a format with precise positioning.
Such characters may even be ignored by the format.
I can’t give you a precise answer yet, tomorrow I’ll check with the development team and write to you.

louis.a · April 16, 2025, 6:19am

Hello,

On this I totally agree.
If you solve the problem by totally ignoring this character in SVG output and not returning it’s position in the CharInfo collection I am fine with it.

Regards

sergei.shibanov · April 16, 2025, 6:50am

@louis.a
I have created a conversion task, we will wait for a fix from the development team.