Saving as SVG mess up text extraction in this document

In this attached document
4411_NOK.pdf (114.5 KB)

If I try to do text extraction after saving the doc as SVG, some characters are substituted by others.

It reads : “En utilisant nos services ou en vous inscrivant, vous
accepteB ces Conditions ,./tilisation.”
It should be : “En utilisant nos services ou en vous inscrivant, vous
acceptez ces Conditions d’Utilisation.”

In the following sample, if I do not comment the SVG output part, the text extraction wrong.

public static void Run() // original
{
    string dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    Document doc = new Document(dataDir + "4411_NOK.pdf");

    { // Comment this and the extraction is correct
        SvgSaveOptions saveOptions = new SvgSaveOptions();
        saveOptions.CompressOutputToZipArchive = false;
        try
        {
            using var memStream = new MemoryStream();
            doc.Save(memStream, saveOptions);
        }
        catch
        {
        }
    }

    Page page = doc.Pages[1];

    ParagraphAbsorber absorber = new ParagraphAbsorber();
    absorber.Visit(page);

    absorber.IsMulticolumnParagraphsAllowed = true;

    foreach (PageMarkup markup in absorber.PageMarkups)
    {
        markup.IsMulticolumnParagraphsAllowed = true;

        foreach (MarkupSection section in markup.Sections)
        {
            foreach (MarkupParagraph paragraph in section.Paragraphs)
            {
                StringBuilder paragraphText = new StringBuilder();

                foreach (List<TextFragment> line in paragraph.Lines)
                {
                    foreach (TextFragment fragment in line)
                    {
                        paragraphText.Append(fragment.Text);
                    }
                    paragraphText.Append("\r\n");
                }
                paragraphText.Append("\r\n");

                Console.WriteLine(paragraphText.ToString());
            }
        }
    }
}

@louis.a

The issue you’re experiencing with text extraction after saving the PDF as SVG is likely due to how the SVG format handles fonts and text rendering. When converting a PDF to SVG, certain characters may not be represented correctly, resulting in substitutions in the extracted text.

Cause: The problem may stem from improper font embedding in the PDF, which affects how text is rendered in the SVG format.

Solution: Here are several steps to troubleshoot and resolve this issue:

  1. Check Font Embedding: Ensure that the fonts used in your PDF are properly embedded. If the fonts are not embedded, the SVG conversion may not render them correctly, leading to character substitutions.

  2. Use Different SVG Save Options: Experiment with various settings in the SvgSaveOptions. For instance, try enabling or disabling compression or adjusting the SvgExternalImageType to see if it affects the output.

  3. Text Extraction Options: When extracting text, utilize the TextExtractionOptions class to specify how text should be extracted. Setting the TextFormattingMode to Raw or Pure may improve the accuracy of the extracted text.

  4. Test with Other Documents: Save and extract text from other PDF documents to determine if the issue is specific to the document you’re working with or if it occurs across multiple files.

  5. Update Aspose.PDF: Ensure you are using the latest version of Aspose.PDF, as updates often include bug fixes and improvements that could resolve your issue.

Next Steps: Implement the suggested troubleshooting steps and test the results. Here’s a minimal code snippet to illustrate how you might set the SvgSaveOptions and extract text:

Document doc = new Document("4411_NOK.pdf");

// Save as SVG
SvgSaveOptions saveOptions = new SvgSaveOptions();
saveOptions.CompressOutputToZipArchive = false; // Adjust as needed
doc.Save("output.svg", saveOptions);

// Extract text
TextAbsorber absorber = new TextAbsorber();
doc.Pages.Accept(absorber);
string extractedText = absorber.Text;
Console.WriteLine(extractedText);

Please let me know if these steps help or if you require further assistance!

No it doesn’t help.

The problem is whatever option I use, an SVG export should have no incidence on a text extraction I do after. Those are two unrelated operations (or it should)

@louis.a

Have you tried re-initializing the Document after saving as SVG? Does this resolve the issue you are facing? Please share if re-initializing doesn’t suit you. We will try to investigate and address the issue accordingly.