Aspose pdf can't determine exactly where to break paragraphs or sentences

Ethan111 · January 8, 2025, 8:47am

Issue: Clearly mixed two paragraphs.
Version: 24.12
Log output: Chiu family and sta! outside the again was an extremely humbling
newly-opened shop, 1979 experience for both father and
son as they sought to build up their new dealership. Chiu Piano’s
"rst brands, such as the English-made Cramer, were very well
received and served as the perfect stepping stones to approach the
industry’s most established brands.

@Test
    void extractText() throws Exception {
        List<String> list = new ArrayList<>();
        Path filePath = Paths.get("src/test/resources/pdf/Piano_Pioneers_of_Singapore_Chiu_indesign_Extracted.pdf");
        InputStream file = Files.newInputStream(filePath);
        Document doc = new Document(file);
        ParagraphAbsorberOptions absorberOptions = new ParagraphAbsorberOptions();
        absorberOptions.setSectionUnbreakingHorizontalOverride(0.0005);
        absorberOptions.setSectionUnbreakingVerticalOverride(0.0005);
        ParagraphAbsorber absorber = new ParagraphAbsorber(absorberOptions);
        absorber.visit(doc);

        for (PageMarkup markup : absorber.getPageMarkups()) {
            log.info("111111 markup.getNumber:{} markup.getSections().size:{} markup.getParagraphs().size:{},markup.getTextFragments().size:{}", markup.getNumber(), markup.getSections().size(), markup.getParagraphs().size(), markup.getTextFragments().size());
            for (MarkupSection section : markup.getSections()) {
                log.info("222222 getFragments.size:{} getParagraphs.size:{}", section.getFragments().size(), section.getParagraphs().size());
                StringBuilder text = new StringBuilder();
                for (MarkupParagraph paragraph : section.getParagraphs()) {
                    log.info("333333 getLines.size:{} getLinesInternal.size:{} getFragments().size:{}", paragraph.getLines().size(), paragraph.getLinesInternal().size(), paragraph.getFragments().size());
                    log.info("paragraph.getText:{}", paragraph.getText());
                }
            }
        }
    }

image.jpg (297.2 KB)

Piano_Pioneers_of_Singapore_Chiu_indesign_Extracted.pdf (561.7 KB)

Professionalize.Discourse · January 8, 2025, 8:48am

@Ethan111

It seems you are experiencing issues with paragraph and sentence breaks when extracting text from a PDF using Aspose.PDF. This can often occur due to the complex formatting and layout of PDF documents, which may not always translate cleanly into plain text.

To address this issue, you can try adjusting the parameters of the ParagraphAbsorberOptions that you are using in your code. Specifically, the setSectionUnbreakingHorizontalOverride and setSectionUnbreakingVerticalOverride methods can be fine-tuned to better handle the spacing and layout of the text. Here’s a brief overview of how you might adjust these settings:

ParagraphAbsorberOptions absorberOptions = new ParagraphAbsorberOptions();
absorberOptions.setSectionUnbreakingHorizontalOverride(0.001); // Adjust this value
absorberOptions.setSectionUnbreakingVerticalOverride(0.001); // Adjust this value

By increasing these values slightly, you may be able to reduce the mixing of paragraphs and improve the clarity of the extracted text. Additionally, ensure that you are using the latest version of Aspose.PDF, as updates often include improvements and bug fixes related to text extraction.

If the problem persists, consider using the TextFragmentAbsorber class, which can provide more control over how text is extracted and may yield better results for your specific document structure.

For further assistance, you might want to explore the following resources:

The tutorial on extracting text using the TextDevice class, which provides a comprehensive guide on handling text extraction effectively.
The documentation on text segments in PDF files, which may offer insights into managing text layout and formatting during extraction.

If you continue to face challenges, please provide additional details about the specific formatting issues you are encountering, and I would be happy to help further.

Sources:
[1]: Extract Text Using Text Device | Aspose.PDF for .NET API Reference
[2]: Text Segments In PDF File | Aspose.PDF for .NET API Reference

Ethan111 · January 8, 2025, 8:55am

I tried the range: 5~0.0000001, it still can’t resolve it

absorberOptions.setSectionUnbreakingHorizontalOverride(0.0000001); absorberOptions.setSectionUnbreakingVerticalOverride(0.0000001);

Perhaps your SDK can determine this by the distance between the surrounding text.

asad.ali · January 8, 2025, 1:52pm

@Ethan111

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-44629

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.