How to resolve conflict between TextAbsorber and TextFragmentAbsorber

jungsun.park · April 18, 2023, 7:53pm

Disclaimer: We don’t have access to PDF files that cause this issue.

Hello Aspose,

We have automated redaction tool for PDF that builds map of each character in a page of PDF. To do this, we extract text from a page using both TextAbsorber and TextFragmentAbsorber. We have few reasons why we extract in both ways, and one reason is that TextFragment doesn’t have information (e.x coordinates) on the actual line in the PDF page that a word/character is part of. This approach works well for the most of PDF files. However, for certain PDF files, extracted text results are different between 2 absorb methods, and this cause exception for our tool.

Here are 2 ways we extract texts:

1. Extract text using TextAbsorber

	private string GetPageText(Aspose.Pdf.Page page)
	{
		var absorber =
			new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving));
		page.Accept(absorber);
		return absorber.Text;
	}

2. Extract text using TextFragmentAbsorber

	private Queue<CharacterInfo> GetPageCharacterInformation(Aspose.Pdf.Page page)
	{
		var absorber = new TextFragmentAbsorber(new Regex("\\S+"))
		{
			ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving)
		};
		page.Accept(absorber);
		return new Queue<CharacterInfo>(GetCharacterInfoFromTextFragments(absorber.TextFragments));
	}

	private static IEnumerable<CharacterInfo> GetCharacterInfoFromTextFragments(TextFragmentCollection fragments)
	{
		return fragments
			.SelectMany(x => x.Segments)
			.SelectMany(SegmentToCharacterInfo);
	}

Essentially, characters returned from GetPageText(page) doesn’t match with characters returned from GetPageCharacterInformation(page). We don’t have access PDF files that cause such conflicts but are aware that this issue happens frequently.

Is this known issue by Aspose? Do you have any suggestion on how we can resolve this?

Thanks,
Jung

carlos.molina · April 18, 2023, 8:21pm

@jungsun.park,

This is not a known issue. It is probably a borderline line scenario if not a broken PDF.

I would suggest you try TextFragmentAbsorbers instead of just TextAbsorbers and see if it works any better.

If the issue persists you will have to implement some sort of log to save those PDF that are causing the issue so I can help you out.

I cannot create some code sample or create a ticket for the dev team without replicating this first.

Also, are you using the latest version 23.4?

jungsun.park · April 18, 2023, 8:29pm

@carlos.molina

Can explain bit more about a borderline line scenario? By any chance, you have sample PDF file that has this?

For our specific needs, we need to use both TextFragmentAbsorbers andTextAbsorbers. Issue arise from mismatch between them. I am not sure what you meant by “try TextFragmentAbsorbers instead of just TextAbsorbers”.

We are currently on version 21.1.0. Will upgrade to 23.4 help with this issue?

carlos.molina · April 18, 2023, 8:37pm

@jungsun.park,

I cannot explain it; it was a guess since, without the document, I cannot even run your code to try it out.

So if there is a mismatch between the results from the document you do not have, I would just keep the one that returns the most information.

What I am telling you is from a developer’s perspective if I was in your shoes. Since, from my perspective, a support member, there is not much I can do when a problem described cannot be replicated.

Since it cannot be replicated, I cannot tell you the new version will fix it because we do not know what problem you are having.

But you have a version over 2 years old, so I can tell you there are improvements in the new version. Aspose Pdf API has minor release every month and a major every year. So you are very behind in updates.

You can request a temporary license and give a test run before committing to a purchase. You can request a license here: Temporary License - Purchase - aspose.com

jungsun.park · April 18, 2023, 8:57pm

@carlos.molina

I was thinking you knew of specific case by “borderline line scenario” because you specifically mentioned it with “probably”. I understand you are guessing and I am guessing as well. With what I have, I am trying to figure out if there is a clear case where this conflict can occur so that I can debug and find solution around it.

Understood. It really was a long shot, and you won’t be able to tell if upgrade will help with situation. I was hoping you could search in your internal system to figure out if specific improvement was made over pdf text or text fragment area since 21.1.0.

carlos.molina · April 26, 2023, 6:42pm

@jungsun.park,

It is hard for me to search for a specific improvement that could fix the issue you are facing when the issue itself is not clear.

I can link you the release notes: Release Notes, but I cannot say in good faith that if you see some work on the Absorbers, those changes fix the problem you are facing.

That’s why I encourage you to implement the log so we can get one of those PDFs that fails. then, we can test it against the newest API and see if the issue is still present, I can take the pdf and this information and present it to the dev team as a ticket.

jungsun.park · April 26, 2023, 7:06pm

Thanks @carlos.molina

After bit more digging, I think we found a solution that might work. We just learned that if you call TextFragmentAbsorber constructor without any regex parameter, absorber actually contains extracted text. With this, we don’t have to use separate TextAbsorber. We haven’t deployed this change to our customers yet but it looks promising.

This can be marked resolved.