TextFragment are duplicate during text extraction

fabien.levalois · November 25, 2024, 3:57pm

The extraction of this file
TEST_TEXTE.pdf (26.8 KB)
have an issue, some fragment of the text are duplicate as you can see blow in the result text extraction
TEST_TEXTE-2.txt.zip (321 Bytes)
I use the last version of Aspose (24-10)
below the java used to extract
image.png (26.5 KB)
asposeExtractText.zip (779 Bytes)
can you indicate how to solved this is issue?
regards
Fabien

asad.ali · November 25, 2024, 9:28pm

@fabien.levalois

Instead of initializing TextFragmentAbsorber outside the loop, please use it inside i.e. create new instance for every page (OR you can extract text from all page at once):

System.out.println("Page count: " + document.getPages().size());

for (Page aPage : document.getPages()) {
    TextFragmentAbsorber tfa = new TextFragmentAbsorber();
    // Extract text fragments from the page
    aPage.accept(tfa);
    for (TextFragment tf : tfa.getTextFragments()) {
        // Write extracted contents to the writer
        writer.append(tf.getText());
        writer.newLine();
        System.out.println(tf.getText()); // Print the extracted text to the console
    }
}

fabien.levalois · November 26, 2024, 5:08pm

Hi asad.

I Change my java, as you can see in the image below
image.png (58.0 KB)

but nothing change i still have the duplicate fragment

Regards
Fabien

asad.ali · November 26, 2024, 7:06pm

@fabien.levalois

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-44535

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.