In this attached document
4411_NOK.pdf (114.5 KB)
If I try to do text extraction after saving the doc as SVG, some characters are substituted by others.
It reads : “En utilisant nos services ou en vous inscrivant, vous
accepteB ces Conditions ,./tilisation.”
It should be : “En utilisant nos services ou en vous inscrivant, vous
acceptez ces Conditions d’Utilisation.”
In the following sample, if I do not comment the SVG output part, the text extraction wrong.
public static void Run() // original
{
string dataDir = RunExamples.GetDataDir_AsposePdf_Text();
Document doc = new Document(dataDir + "4411_NOK.pdf");
{ // Comment this and the extraction is correct
SvgSaveOptions saveOptions = new SvgSaveOptions();
saveOptions.CompressOutputToZipArchive = false;
try
{
using var memStream = new MemoryStream();
doc.Save(memStream, saveOptions);
}
catch
{
}
}
Page page = doc.Pages[1];
ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.Visit(page);
absorber.IsMulticolumnParagraphsAllowed = true;
foreach (PageMarkup markup in absorber.PageMarkups)
{
markup.IsMulticolumnParagraphsAllowed = true;
foreach (MarkupSection section in markup.Sections)
{
foreach (MarkupParagraph paragraph in section.Paragraphs)
{
StringBuilder paragraphText = new StringBuilder();
foreach (List<TextFragment> line in paragraph.Lines)
{
foreach (TextFragment fragment in line)
{
paragraphText.Append(fragment.Text);
}
paragraphText.Append("\r\n");
}
paragraphText.Append("\r\n");
Console.WriteLine(paragraphText.ToString());
}
}
}
}
@louis.a
The issue you’re experiencing with text extraction after saving the PDF as SVG is likely due to how the SVG format handles fonts and text rendering. When converting a PDF to SVG, certain characters may not be represented correctly, resulting in substitutions in the extracted text.
Cause: The problem may stem from improper font embedding in the PDF, which affects how text is rendered in the SVG format.
Solution: Here are several steps to troubleshoot and resolve this issue:
-
Check Font Embedding: Ensure that the fonts used in your PDF are properly embedded. If the fonts are not embedded, the SVG conversion may not render them correctly, leading to character substitutions.
-
Use Different SVG Save Options: Experiment with various settings in the SvgSaveOptions. For instance, try enabling or disabling compression or adjusting the SvgExternalImageType to see if it affects the output.
-
Text Extraction Options: When extracting text, utilize the TextExtractionOptions class to specify how text should be extracted. Setting the TextFormattingMode to Raw or Pure may improve the accuracy of the extracted text.
-
Test with Other Documents: Save and extract text from other PDF documents to determine if the issue is specific to the document you’re working with or if it occurs across multiple files.
-
Update Aspose.PDF: Ensure you are using the latest version of Aspose.PDF, as updates often include bug fixes and improvements that could resolve your issue.
Next Steps: Implement the suggested troubleshooting steps and test the results. Here’s a minimal code snippet to illustrate how you might set the SvgSaveOptions and extract text:
Document doc = new Document("4411_NOK.pdf");
// Save as SVG
SvgSaveOptions saveOptions = new SvgSaveOptions();
saveOptions.CompressOutputToZipArchive = false; // Adjust as needed
doc.Save("output.svg", saveOptions);
// Extract text
TextAbsorber absorber = new TextAbsorber();
doc.Pages.Accept(absorber);
string extractedText = absorber.Text;
Console.WriteLine(extractedText);
Please let me know if these steps help or if you require further assistance!
No it doesn’t help.
The problem is whatever option I use, an SVG export should have no incidence on a text extraction I do after. Those are two unrelated operations (or it should)
@louis.a
Have you tried re-initializing the Document after saving as SVG? Does this resolve the issue you are facing? Please share if re-initializing doesn’t suit you. We will try to investigate and address the issue accordingly.