I have attached two pdfs from a project, one that works correctly, and one that does not. I stripped out any identifying information.
Broken.pdf (117.9 KB)
Works.pdf (229.6 KB)
Both pdfs are from the same project. When I do a text search on the files using textFragmentAbsorber, I get different results. The Works.pdf file returns all the file text as expected. I can access all text or search in a specific coordinate rectangle and get expected results. The current function I have is searching for the drawing number in the bottom right, which for Works.pdf is “29-A-SECELE-01”. I can find that value using a rect of (2176.243200, 66.110400, 2395.159200, 103.240800).
If I run that same rect search on the Broken.pdf, I get a garbage result: “63/1$”. The symbols don’t come through here, but I will post a screenshot.
SymbolCharacters.PNG (1.0 KB)
I have also tried exporting all textFragmentAbsorber text on the page and that yields a similar result. A few pieces of text come through in the middle, but mostly garbage.
Here is a partial view of the text export of Broken.pdf that shows the text “Verify Scales” surrounded by other symbols:
VerifyScales.PNG (5.8 KB)
I have confirmed that the text on Broken.pdf is selectable/searchable in a pdf viewer. The only difference I have found between the two files is that the broken file page.Rotate == Rotation.on90. I tried rotating the page before text read both in an external pdf editor and in aspose with the same result.
I am using Aspose.PDF .Net 23.10.0.0 on .Net framework 4.8.1
Here is the code I am using to export all text:
InitializeLicense();
// Open the document
using (Document pdfDocument = new Document(_filePath))
{
string returnVal = string.Empty;
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
// Loop through all the pages
foreach (Page page in pdfDocument.Pages)
{
page.Accept(absorber);
// Log page number and rotation
foreach (var fragment in absorber.TextFragments)
{
_logger.Log($"Page: {page.Number} \"{fragment.Text}\"");
}
absorber.Reset();
}
}
Is there something else I need to do to account for the page rotation? Or is there another reason the textFragmentAbsorber is not finding all the text in the file?