cotiz_anonyme.pdf (2.4 KB)
When working with this attached document, if I extract text position in a way similar to your sample code “HighlightCharacterInPDF”
for (int charNum = 1; charNum <= segment.Characters.Count; charNum++)
{
CharInfo characterInfo = segment.Characters[charNum];
Aspose.Pdf.Rectangle rect = page.GetPageRect(true);
Console.WriteLine("TextFragment = " + textFragment.Text + " Page URY = " + rect.URY +
" TextFragment URY = " + textFragment.Rectangle.URY);
gr.DrawRectangle(
Pens.Black,
(float)characterInfo.Rectangle.LLX,
(float)characterInfo.Rectangle.LLY,
(float)characterInfo.Rectangle.Width,
(float)characterInfo.Rectangle.Height);
}
I get the position of all characters, including the TAB (\011) character.
The problem is the Segment.Text doesn’t contain the TAB character or any placeholder. So there is a gap between the CharInfo collection and the characters in Text, and I can only guess what CharInfo correspond to what char if I am aware those tabs exists.
Also, if I try to convert this doc to SVG, the output SVG contains NULL characters and it can’t be opened in any viewer.
public class PDFToSVG
{
public static void Run()
{
// ExStart:PDFToSVG
// The path to the documents directory.
string dataDir = RunExamples.GetDataDir_AsposePdf_DocumentConversion();
// Load PDF document
Document doc = new Document(dataDir + "cotiz_anonyme.pdf");
// Instantiate an object of SvgSaveOptions
SvgSaveOptions saveOptions = new SvgSaveOptions();
// Do not compress SVG image to Zip archive
saveOptions.CompressOutputToZipArchive = false;
// Save the output in SVG files
doc.Save(dataDir + "PDFToSVG_out.svg", saveOptions);
// ExEnd:PDFToSVG
}
}
PDFToSVG_out.zip (838.3 KB)