Conversion from pdf to text is not supporting for superscript and subscript formats(JAVA API)(Version: 23.1)

eurodyn · July 14, 2025, 7:33am

Dear Aspose Support Team,

Conversion from PDF to text is not supported for superscript and subscript formats using the following code snippet for version 23.1:

public String convertToText(Document pdfDocument, int page) throws Exception {
log.debug(“Entering convertToText()”);
// Create a TextFragmentAbsorber to extract text with formatting
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();

// Accept the absorber for all PDF pages or a specific page
if (page == 0) {
	pdfDocument.getPages().accept(textFragmentAbsorber);
} else {
	pdfDocument.getPages().get_Item(page).accept(textFragmentAbsorber);
}

// Get the extracted text with formatting
StringBuilder extractedText = new StringBuilder();
for (TextFragment fragment : textFragmentAbsorber.getTextFragments()) {
	extractedText.append(fragment.getText());
}
log.debug("extractedText"+extractedText.toString());
log.debug("Exiting convertToText()");
return removeGarbageText(extractedText.toString());

}
proper pdf sub_superscripts.pdf (18.0 KB)

NewTest1.pdf (40.7 KB)

Could you please suggest a solution with uploaded PDFs?

Thanks.

Professionalize.Discourse · July 14, 2025, 7:34am

@eurodyn

Cause

The issue you’re facing with the conversion from PDF to text not supporting superscript and subscript formats is likely due to the limitations in the text extraction capabilities of the Aspose.PDF library in the version you are using (23.1). The TextFragmentAbsorber extracts text but may not fully retain formatting details like superscripts and subscripts.

Solution

To address this, you can manually check each TextFragment for superscript and subscript properties after extraction. You can use the isSuperscript() and isSubscript() methods on each TextFragment to identify these formats and handle them accordingly.

Here’s how you can modify your existing code to include this check:

Code

public String convertToText(Document pdfDocument, int page) throws Exception {
    log.debug("Entering convertToText()");
    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();

    if (page == 0) {
        pdfDocument.getPages().accept(textFragmentAbsorber);
    } else {
        pdfDocument.getPages().get_Item(page).accept(textFragmentAbsorber);
    }

    StringBuilder extractedText = new StringBuilder();
    for (TextFragment fragment : textFragmentAbsorber.getTextFragments()) {
        // Check for superscript and subscript
        if (fragment.isSuperscript()) {
            extractedText.append("<sup>").append(fragment.getText()).append("</sup>");
        } else if (fragment.isSubscript()) {
            extractedText.append("<sub>").append(fragment.getText()).append("</sub>");
        } else {
            extractedText.append(fragment.getText());
        }
    }
    log.debug("extractedText: " + extractedText.toString());
    log.debug("Exiting convertToText()");
    return removeGarbageText(extractedText.toString());
}

Next Steps

Implement the modified code to check for superscript and subscript formats.
Test the output to ensure that the formatting is preserved as expected.
If you still encounter issues, consider upgrading to a newer version of Aspose.PDF, as improvements and bug fixes are regularly released.

Please implement these changes and let me know if you need further assistance!

eurodyn · July 18, 2025, 2:59pm

Hello Aspose Team,

The given solution is not working with the attached PDFs. Will you please review the solution and try with the attached documents? Please provide some solution at the earliest.

Thanks.

asad.ali · July 19, 2025, 6:11pm

@eurodyn

We are checking it and will get back to you shortly.

asad.ali · July 20, 2025, 5:14pm

@eurodyn

Would you kindly confirm how you are saving the extracted text and in which format? OR do you intend to show the extracted text in some other control in the applicaton?

eurodyn · July 21, 2025, 7:45am

@asad.ali

Yes, we are intended to show the extracted text in our application.
Please try with our pdf and provide the proper solution.

Thanks.

asad.ali · July 21, 2025, 5:11pm

@eurodyn

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-45224

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

PS: Would you please share a little more details about how you intend to show the extracted text in your application? Is it inside TextBox?

eurodyn · July 22, 2025, 7:29am

@asad.ali

The extracted text is directly shown inside the textbox, and we know that textbox supports these (superscript and subscript) formats.

Thanks.

asad.ali · July 22, 2025, 4:30pm

@eurodyn

Thanks for the details. We have updated the ticket information accordingly and will update you as soon as we have some updates regarding its resolution.

eurodyn · November 5, 2025, 1:09pm

Hello @asad.ali

Please try with our pdf and provide the proper solution asap.

Thanks.

asad.ali · November 5, 2025, 7:36pm

@eurodyn

We are afraid that the ticket could not get resolved yet due to other pending issues in the queue. We have recorded your concerns and will surely consider them during ticket investigation. We will share updates with you as soon as we make some progress towards ticket resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.