Pdf extract text TextfragmenAbsorber

dalazi · March 16, 2024, 8:10am

When I have an element in my pdf that represents a table, the two columns are merged into one column when I extract TextFragment.

pdf file
测试术语.pdf (123.9 KB)
Extract the content after translation
image.jpg (147.1 KB)

Note The following: 1 10 mm THK. CLEAR TEMPERED GLASS, ALU FILLET
Here the 1 and 10MM THK. CLEAR TEMPERED GLASS, ALU FILLET are two columns but the extract is one TextFragment
image.png (386.8 KB)

asad.ali · March 16, 2024, 2:57pm

@dalazi

Can you please share a bit more details like how you are making a translated version of the PDF and how you are extracting text? Please share the sample code snippet with us so that we can test the scenario in our environment and address it accordingly.

dalazi · March 16, 2024, 3:21pm

like this
image.png (59.8 KB)

image.png (104.6 KB)

1 and 10MM THK. CLEAR TEMPERED GLASS, ALU FILLET Not two columns, but one

asad.ali · March 17, 2024, 12:05am

@dalazi

Instead of using TextFragmentAbsorber, can you please try to extract the table using TableAbsorber class and let us know if you notice some improvements.

dalazi · March 17, 2024, 3:54am

But this isn’t really a table element, it’s just a table in presentation

The same is true
image.png (73.0 KB)

asad.ali · March 17, 2024, 3:42pm

@dalazi

We are logging an investigation ticket in our issue management system for this case and will sharing the ticket ID with you shortly.

dalazi · March 19, 2024, 3:45am

I want to insert text in the original location, but find the location is wrong
image.png (261.4 KB)
image.png (66.1 KB)

origin pdf
测试术语.pdf (123.9 KB)

asad.ali · March 19, 2024, 8:57pm

@dalazi

We were able to notice these issues in our environment. Therefore, we have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56834 (Text Extraction),PDFNET-56835 (Text Insertion)

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

dalazi · March 20, 2024, 8:10am

When I have a lot of text to update, I can’t catch the exception and the program is killed

image.png (56.2 KB)
image.png (450.8 KB)

code
image.png (38.5 KB)

pdf
1.pdf (2.5 MB)

code:

var stopWatch = new Stopwatch();
var document = new Aspose.Pdf.Document(pathToPdf);
Aspose.Pdf.Text.TextFragmentAbsorber absorber = new();
stopWatch.Restart();
absorber.Visit(document);
//absorber.ApplyForAllFragments(0);
//absorber.RemoveAllText(document);
foreach (var textFragment in absorber.TextFragments)
{
    // 获取文本段落的水平对齐方式
    Console.WriteLine(textFragment.Text);
    textFragment.Text = textFragment.Text + "i";
}
stopWatch.Stop();

asad.ali · March 20, 2024, 6:08pm

@dalazi

Can you please share which version of the API are you using and what are your environment details like OS Name and Version, Application Type, etc.? Please make sure to test with 24.3 version in case it helps.

dalazi · March 21, 2024, 3:29am

24.3 version

asad.ali · March 21, 2024, 5:30pm

@dalazi

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56860

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.