Hello,
I have java code to open a PDF and extract the text while preserving the newline characters (converting them to xml complaint versions). The code executes and returns the correct data, however there seem to be resources holding onto the PDF. I’ve tried document.dispose() and document.close() methods, but that doesn’t seem to be working. Is there a way to fix this? we’re using aspose-pdf-18.9.1.jar.
Thank you!
String filePath=“C:/Temp/file.pdf”;
Document document = new com.aspose.pdf.Document(filePath);
Map<String,Integer> combineSegment = new LinkedHashMap<String , Integer>();
ParagraphAbsorber absorber = new ParagraphAbsorber();
//absorber.visit(document);
PdfAnnotationEditor editor = new PdfAnnotationEditor();
editor.bindPdf(filePath);
PageCollection pgColl = editor.getDocument().getPages();
for(int i = 0; i<pgColl.size(); i++){
absorber.visit(editor.getDocument().getPages().get_Item(i+1));
}
List<PageMarkup> pageMarkups = absorber.getPageMarkups();
// Paragraph Counter
Integer count = 0;
for (PageMarkup markup : pageMarkups) {
List<MarkupSection> sections = markup.getSections();
for (MarkupSection section : sections) {
List<MarkupParagraph> paragraphs = section.getParagraphs();
for (MarkupParagraph paragraph : paragraphs) {
count = count + 1;
List<I27<TextFragment>> lines = paragraph.getLines();
for (List<TextFragment> line : lines) {
for (TextFragment fragment : line) {
// Iterate on paragraph and combine the lines in same paragraph
// Map Contains position of text fragment as key to paragraph number as value
combineSegment.put(fragment.getBaselinePosition().toString(), count);
}
}
lines.clear();
}
}
}
StringBuilder line = new StringBuilder();
StringBuilder output = new StringBuilder();
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
//Set Extraction Options as Raw This will give new line as space.
textFragmentAbsorber.setExtractionOptions(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
// Accept the absorber
document.getPages().accept(textFragmentAbsorber);
// Get the extracted text fragments
com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
for (TextFragment fragment : textFragmentCollection) {
Position baselinePosition = fragment.getBaselinePosition();
// If key not present then it's a new line
if(!combineSegment.containsKey(baselinePosition.toString())) {
// If line buffer is not empty that means we are processing lines in a paragraph and these new lines should not be printed.
if(line.toString().isEmpty()) {
// We get space for new line in this version of API.
if(fragment.getText().trim().isEmpty())
output.append(" ");
else
output.append(fragment.getText()); }
} else {
// If key present then remove the position from map and get the paragraph number on which we are working
Integer remove = combineSegment.remove(baselinePosition.toString());
//If more text present for same paragraph then continue appending
boolean contains = combineSegment.values().contains(remove);
if(contains) {
line.append(fragment.getText());
continue;
} else {
//If all the data for particular paragraph is extracted then display the data and reset the buffer for next paragraph
line.append(fragment.getText());
output.append(line.toString());
line = new StringBuilder();
}
}
}
String extractedText = output.toString();
document.close();
document.dispose();
return extractedText;