Conversion from PDF to text is not supported for superscript and subscript formats using the following code snippet for version 23.1:
public String convertToText(Document pdfDocument, int page) throws Exception {
log.debug(“Entering convertToText()”);
// Create a TextFragmentAbsorber to extract text with formatting
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
// Accept the absorber for all PDF pages or a specific page
if (page == 0) {
pdfDocument.getPages().accept(textFragmentAbsorber);
} else {
pdfDocument.getPages().get_Item(page).accept(textFragmentAbsorber);
}
// Get the extracted text with formatting
StringBuilder extractedText = new StringBuilder();
for (TextFragment fragment : textFragmentAbsorber.getTextFragments()) {
extractedText.append(fragment.getText());
}
log.debug("extractedText"+extractedText.toString());
log.debug("Exiting convertToText()");
return removeGarbageText(extractedText.toString());
The issue you’re facing with the conversion from PDF to text not supporting superscript and subscript formats is likely due to the limitations in the text extraction capabilities of the Aspose.PDF library in the version you are using (23.1). The TextFragmentAbsorber extracts text but may not fully retain formatting details like superscripts and subscripts.
Solution
To address this, you can manually check each TextFragment for superscript and subscript properties after extraction. You can use the isSuperscript() and isSubscript() methods on each TextFragment to identify these formats and handle them accordingly.
Here’s how you can modify your existing code to include this check:
Code
public String convertToText(Document pdfDocument, int page) throws Exception {
log.debug("Entering convertToText()");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
if (page == 0) {
pdfDocument.getPages().accept(textFragmentAbsorber);
} else {
pdfDocument.getPages().get_Item(page).accept(textFragmentAbsorber);
}
StringBuilder extractedText = new StringBuilder();
for (TextFragment fragment : textFragmentAbsorber.getTextFragments()) {
// Check for superscript and subscript
if (fragment.isSuperscript()) {
extractedText.append("<sup>").append(fragment.getText()).append("</sup>");
} else if (fragment.isSubscript()) {
extractedText.append("<sub>").append(fragment.getText()).append("</sub>");
} else {
extractedText.append(fragment.getText());
}
}
log.debug("extractedText: " + extractedText.toString());
log.debug("Exiting convertToText()");
return removeGarbageText(extractedText.toString());
}
Next Steps
Implement the modified code to check for superscript and subscript formats.
Test the output to ensure that the formatting is preserved as expected.
If you still encounter issues, consider upgrading to a newer version of Aspose.PDF, as improvements and bug fixes are regularly released.
Please implement these changes and let me know if you need further assistance!
The given solution is not working with the attached PDFs. Will you please review the solution and try with the attached documents? Please provide some solution at the earliest.
Would you kindly confirm how you are saving the extracted text and in which format? OR do you intend to show the extracted text in some other control in the applicaton?
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFJAVA-45224
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.
PS: Would you please share a little more details about how you intend to show the extracted text in your application? Is it inside TextBox?
Thanks for the details. We have updated the ticket information accordingly and will update you as soon as we have some updates regarding its resolution.
We are afraid that the ticket could not get resolved yet due to other pending issues in the queue. We have recorded your concerns and will surely consider them during ticket investigation. We will share updates with you as soon as we make some progress towards ticket resolution. Please be patient and spare us some time.