Aspose pdf is failing to convert text containing superscript and subscript from pdf to text

Hello,

I’m using TextAbsorber to extract text from pdf but it is failing to convert the superscript and subscript from pdf. The pdf contains simple mathematical and chemistry equations. Please guide me to get it done, I’m using java code as follows.

public String convertToText(Document pdfDocument, int page)
throws Exception {
// Create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
// Accept the absorber for all PDF page
if (page == 0) {
pdfDocument.getPages().accept(textAbsorber);
} else {
// Accept the absorber for particular PDF page
pdfDocument.getPages().get_Item(page).accept(textAbsorber);
}
// Get the extracted text
String extractedText = textAbsorber.getText();
return extractedText;
}

I’m uploading sample pdf for reference. In the pdf page 6 has some equation you can go through.
comp4.pdf (333.1 KB)

@Shivah6

Cause

The issue you’re facing with extracting superscript and subscript text from a PDF using the TextAbsorber class is a common challenge. The default text extraction methods may not handle these formatting styles correctly, especially in mathematical and chemical equations.

Solution

To effectively extract superscript and subscript text, you can utilize the TextAbsorber in conjunction with the TextFragmentAbsorber. This allows you to specifically target and extract formatted text elements, including superscripts and subscripts.

Code

Here’s an updated version of your code that uses TextFragmentAbsorber to capture superscript and subscript text:

import com.aspose.pdf.*;

public String convertToText(Document pdfDocument, int page) throws Exception {
    // Create a TextFragmentAbsorber to extract text with formatting
    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
    
    // Accept the absorber for all PDF pages or a specific page
    if (page == 0) {
        pdfDocument.getPages().accept(textFragmentAbsorber);
    } else {
        pdfDocument.getPages().get_Item(page).accept(textFragmentAbsorber);
    }
    
    // Get the extracted text with formatting
    StringBuilder extractedText = new StringBuilder();
    for (TextFragment fragment : textFragmentAbsorber.getTextFragments()) {
        extractedText.append(fragment.getText());
        // Check for superscript and subscript
        if (fragment.getTextState().getFontStyle().contains(FontStyles.Superscript)) {
            extractedText.append(" (superscript)");
        }
        if (fragment.getTextState().getFontStyle().contains(FontStyles.Subscript)) {
            extractedText.append(" (subscript)");
        }
    }
    
    return extractedText.toString();
}

Next Steps

  1. Replace your existing TextAbsorber implementation with the provided code.
  2. Test the updated method with your PDF document to ensure it captures the superscript and subscript text correctly.
  3. If you encounter any further issues, check the PDF structure or consider using additional libraries for more complex extraction needs.

Please let me know if you need any further assistance!

1 Like

Hello again,

If I use the above code I’m getting
cannot find symbol
[e[1;31mERRORe[m] symbol: variable Superscript
[e[1;31mERRORe[m] location: class com.aspose.pdf.FontStyles

If i check the reference doc for FontStyles in aspose superscript related method is not there.

Please guide to rectify this issue, I need java code for this. Please guide me.

Thanks

Aspose forum, please guide about this issue asap.

@Shivah6

Can you please share some more details like what do you want to achieve after getting superscript and subscript values? Do you intent to update them or you are saving them in specific format after extracting them particularly?

Hello @asad.ali

The converted text from the PDF to TEXT, we are displaying in textbox. But superscript and subscript are not properly displaying after conversion. EX: H2O, x2
After conversion in H2O → 2 is not subscript and in x2 → 2 is not superscript.
I just want to achieve with Java code. Please guide me to complete this.

Thanks.

@Shivah6

We will investigate the scenario in details but first we may need complete details. Can you please share where you are filling this information in text box? Are you filling it in some other PDF and expect to see subscript or superscript in correct format inside textbox? Do these textboxes support such kind of text formatting?

Hello @asad.ali

I’m saving the pdf document, from that I’m using this method to extract that data converted from pdf to text. Just i need aspose java code which supports superscript and subscript. And text box supports such kind of format.

@Shivah6

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-45156

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.