Aspose Pdf Comparison

hi

How to extract true visual lines with text using Aspose.PDF when a line contains mixed font sizes, superscripts or subscripts?

Detailed Query:
I am using Aspose.PDF for Java to extract text from a page with line information for PDF comparison. Currently, I use TextFragmentAbsorber and group TextFragment based on their Y-coordinate, but this approach fails when a single visual line contains mixed font sizes, superscripts or subscripts, since such fragments have different Y values even though they belong to the same visual line.
Is there any other supported way to reconstruct true visual text lines independent of these font size or subscript or superscript shifts?

@sr2989

  • Are you processing text that flows horizontally, or do you also need to handle vertical or rotated text lines?
  • Do you need to preserve the exact visual positioning of superscripts/subscripts relative to baseline, or is grouping them with the main line sufficient?
  • Is the input PDF generated from a known source (e.g., Word, LaTeX) with predictable layout patterns?
  • Have you tried using TextAbsorber with TextExtractionOptions set to PreserveFormatting, and if so, did it meet your needs?

hi , we are trying to use the text absorber with text extraction options set to preserve formatting however it is the plain text extracted overlaps with the text in tables and so there is duplicate comparison results like the below example. How can this be resolved?

@sr2989

If possible, would you please share your sample source and output PDFs with us along with the complete sample code snippet that we could use to replicate the issue in our environment and address it accordingly.