In this attached document
4411_NOK.pdf (114.5 KB)
If I try to do text extraction after saving the doc as SVG, some characters are substituted by others.
It reads : “En utilisant nos services ou en vous inscrivant, vous
accepteB ces Conditions ,./tilisation.”
It should be : “En utilisant nos services ou en vous inscrivant, vous
acceptez ces Conditions d’Utilisation.”
In the following sample, if I do not comment the SVG output part, the text extraction wrong.
public static void Run() // original
{
    string dataDir = RunExamples.GetDataDir_AsposePdf_Text();
    Document doc = new Document(dataDir + "4411_NOK.pdf");
    { // Comment this and the extraction is correct
        SvgSaveOptions saveOptions = new SvgSaveOptions();
        saveOptions.CompressOutputToZipArchive = false;
        try
        {
            using var memStream = new MemoryStream();
            doc.Save(memStream, saveOptions);
        }
        catch
        {
        }
    }
    Page page = doc.Pages[1];
    ParagraphAbsorber absorber = new ParagraphAbsorber();
    absorber.Visit(page);
    absorber.IsMulticolumnParagraphsAllowed = true;
    foreach (PageMarkup markup in absorber.PageMarkups)
    {
        markup.IsMulticolumnParagraphsAllowed = true;
        foreach (MarkupSection section in markup.Sections)
        {
            foreach (MarkupParagraph paragraph in section.Paragraphs)
            {
                StringBuilder paragraphText = new StringBuilder();
                foreach (List<TextFragment> line in paragraph.Lines)
                {
                    foreach (TextFragment fragment in line)
                    {
                        paragraphText.Append(fragment.Text);
                    }
                    paragraphText.Append("\r\n");
                }
                paragraphText.Append("\r\n");
                Console.WriteLine(paragraphText.ToString());
            }
        }
    }
}
             
            
              
              
              
            
            
                
                
              
           
          
            
            
              @louis.a
The issue you’re experiencing with text extraction after saving the PDF as SVG is likely due to how the SVG format handles fonts and text rendering. When converting a PDF to SVG, certain characters may not be represented correctly, resulting in substitutions in the extracted text.
Cause: The problem may stem from improper font embedding in the PDF, which affects how text is rendered in the SVG format.
Solution: Here are several steps to troubleshoot and resolve this issue:
- 
Check Font Embedding: Ensure that the fonts used in your PDF are properly embedded. If the fonts are not embedded, the SVG conversion may not render them correctly, leading to character substitutions. 
- 
Use Different SVG Save Options: Experiment with various settings in the SvgSaveOptions. For instance, try enabling or disabling compression or adjusting theSvgExternalImageTypeto see if it affects the output.
 
- 
Text Extraction Options: When extracting text, utilize the TextExtractionOptionsclass to specify how text should be extracted. Setting theTextFormattingModetoRaworPuremay improve the accuracy of the extracted text.
 
- 
Test with Other Documents: Save and extract text from other PDF documents to determine if the issue is specific to the document you’re working with or if it occurs across multiple files. 
- 
Update Aspose.PDF: Ensure you are using the latest version of Aspose.PDF, as updates often include bug fixes and improvements that could resolve your issue. 
Next Steps: Implement the suggested troubleshooting steps and test the results. Here’s a minimal code snippet to illustrate how you might set the SvgSaveOptions and extract text:
Document doc = new Document("4411_NOK.pdf");
// Save as SVG
SvgSaveOptions saveOptions = new SvgSaveOptions();
saveOptions.CompressOutputToZipArchive = false; // Adjust as needed
doc.Save("output.svg", saveOptions);
// Extract text
TextAbsorber absorber = new TextAbsorber();
doc.Pages.Accept(absorber);
string extractedText = absorber.Text;
Console.WriteLine(extractedText);
Please let me know if these steps help or if you require further assistance!
             
            
              
              
              
            
            
                
                
              
           
          
            
            
              No it doesn’t help.
The problem is whatever option I use, an SVG export should have no incidence on a text extraction I do after.  Those are two unrelated operations (or it should)
             
            
              
              
              
            
            
                
                
              
           
          
            
            
              @louis.a
Have you tried re-initializing the Document after saving as SVG? Does this resolve the issue you are facing? Please share if re-initializing doesn’t suit you. We will try to investigate and address the issue accordingly.