Dear Aspose Support Team,
Conversion from PDF to text is not supported for superscript and subscript formats using the following code snippet for version 23.1:
public String convertToText(Document pdfDocument, int page) throws Exception {
log.debug(“Entering convertToText()”);
// Create a TextFragmentAbsorber to extract text with formatting
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
// Accept the absorber for all PDF pages or a specific page
if (page == 0) {
pdfDocument.getPages().accept(textFragmentAbsorber);
} else {
pdfDocument.getPages().get_Item(page).accept(textFragmentAbsorber);
}
// Get the extracted text with formatting
StringBuilder extractedText = new StringBuilder();
for (TextFragment fragment : textFragmentAbsorber.getTextFragments()) {
extractedText.append(fragment.getText());
}
log.debug("extractedText"+extractedText.toString());
log.debug("Exiting convertToText()");
return removeGarbageText(extractedText.toString());
}
proper pdf sub_superscripts.pdf (18.0 KB)
NewTest1.pdf (40.7 KB)
Could you please suggest a solution with uploaded PDFs?
Thanks.
@eurodyn
Cause
The issue you’re facing with the conversion from PDF to text not supporting superscript and subscript formats is likely due to the limitations in the text extraction capabilities of the Aspose.PDF library in the version you are using (23.1). The TextFragmentAbsorber
extracts text but may not fully retain formatting details like superscripts and subscripts.
Solution
To address this, you can manually check each TextFragment
for superscript and subscript properties after extraction. You can use the isSuperscript()
and isSubscript()
methods on each TextFragment
to identify these formats and handle them accordingly.
Here’s how you can modify your existing code to include this check:
Code
public String convertToText(Document pdfDocument, int page) throws Exception {
log.debug("Entering convertToText()");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
if (page == 0) {
pdfDocument.getPages().accept(textFragmentAbsorber);
} else {
pdfDocument.getPages().get_Item(page).accept(textFragmentAbsorber);
}
StringBuilder extractedText = new StringBuilder();
for (TextFragment fragment : textFragmentAbsorber.getTextFragments()) {
// Check for superscript and subscript
if (fragment.isSuperscript()) {
extractedText.append("<sup>").append(fragment.getText()).append("</sup>");
} else if (fragment.isSubscript()) {
extractedText.append("<sub>").append(fragment.getText()).append("</sub>");
} else {
extractedText.append(fragment.getText());
}
}
log.debug("extractedText: " + extractedText.toString());
log.debug("Exiting convertToText()");
return removeGarbageText(extractedText.toString());
}
Next Steps
- Implement the modified code to check for superscript and subscript formats.
- Test the output to ensure that the formatting is preserved as expected.
- If you still encounter issues, consider upgrading to a newer version of Aspose.PDF, as improvements and bug fixes are regularly released.
Please implement these changes and let me know if you need further assistance!
Hello Aspose Team,
The given solution is not working with the attached PDFs. Will you please review the solution and try with the attached documents? Please provide some solution at the earliest.
Thanks.
@eurodyn
We are checking it and will get back to you shortly.
@eurodyn
Would you kindly confirm how you are saving the extracted text and in which format? OR do you intend to show the extracted text in some other control in the applicaton?
@asad.ali
Yes, we are intended to show the extracted text in our application.
Please try with our pdf and provide the proper solution.
Thanks.
@eurodyn
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFJAVA-45224
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.
PS: Would you please share a little more details about how you intend to show the extracted text in your application? Is it inside TextBox?
@asad.ali
The extracted text is directly shown inside the textbox, and we know that textbox supports these (superscript and subscript) formats.
Thanks.
@eurodyn
Thanks for the details. We have updated the ticket information accordingly and will update you as soon as we have some updates regarding its resolution.