Pdf extract text TextfragmenAbsorber

When I have an element in my pdf that represents a table, the two columns are merged into one column when I extract TextFragment.

pdf file
测试术语.pdf (123.9 KB)
Extract the content after translation
image.jpg (147.1 KB)

Note The following: 1 10 mm THK. CLEAR TEMPERED GLASS, ALU FILLET
Here the 1 and 10MM THK. CLEAR TEMPERED GLASS, ALU FILLET are two columns but the extract is one TextFragment
image.png (386.8 KB)

@dalazi

Can you please share a bit more details like how you are making a translated version of the PDF and how you are extracting text? Please share the sample code snippet with us so that we can test the scenario in our environment and address it accordingly.

like this
image.png (59.8 KB)

image.png (104.6 KB)

1 and 10MM THK. CLEAR TEMPERED GLASS, ALU FILLET Not two columns, but one

@dalazi

Instead of using TextFragmentAbsorber, can you please try to extract the table using TableAbsorber class and let us know if you notice some improvements.

But this isn’t really a table element, it’s just a table in presentation

The same is true
image.png (73.0 KB)

@dalazi

We are logging an investigation ticket in our issue management system for this case and will sharing the ticket ID with you shortly.

I want to insert text in the original location, but find the location is wrong
image.png (261.4 KB)
image.png (66.1 KB)

origin pdf
测试术语.pdf (123.9 KB)

@dalazi

We were able to notice these issues in our environment. Therefore, we have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56834 (Text Extraction),PDFNET-56835 (Text Insertion)

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

When I have a lot of text to update, I can’t catch the exception and the program is killed

image.png (56.2 KB)
image.png (450.8 KB)

code
image.png (38.5 KB)

pdf
1.pdf (2.5 MB)

code:

var stopWatch = new Stopwatch();
var document = new Aspose.Pdf.Document(pathToPdf);
Aspose.Pdf.Text.TextFragmentAbsorber absorber = new();
stopWatch.Restart();
absorber.Visit(document);
//absorber.ApplyForAllFragments(0);
//absorber.RemoveAllText(document);
foreach (var textFragment in absorber.TextFragments)
{
    // 获取文本段落的水平对齐方式
    Console.WriteLine(textFragment.Text);
    textFragment.Text = textFragment.Text + "i";
}
stopWatch.Stop();

@dalazi

Can you please share which version of the API are you using and what are your environment details like OS Name and Version, Application Type, etc.? Please make sure to test with 24.3 version in case it helps.

24.3 version

@dalazi

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56860

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.