Hi, I’m facing an issue with a PDF document that contains repeated or garbled text when using the TextAbsorber.Text property in the .NET library. To address this, I attempted a workaround by converting the PDF to HTML, loading the HTML document as a PDF, and then extracting the clean text using TextAbsorber. This approach worked successfully in the .NET library. However, I encountered difficulties when implementing the same logic in the JAVA library. Below is the code I used:
private static String getConvertedPageText(Page page) {
System.out.println("Due to repeated words - Page converting PDF -> HTML -> PDF");
Document onePageDocument = new Document();
onePageDocument.getPages().add(page);
HtmlSaveOptions saveOptions = new HtmlSaveOptions();
saveOptions.setPartsEmbeddingMode(HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml);
saveOptions.setRasterImagesSavingMode(RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground);
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
BufferedOutputStream bos = new BufferedOutputStream(outStream);
onePageDocument.save(bos, saveOptions); // This line takes more than 10 minutes
byte[] byteArray = outStream.toByteArray();
InputStream inputStream = new ByteArrayInputStream(byteArray, 0, byteArray.length);
HtmlLoadOptions loadOptions = new HtmlLoadOptions();
loadOptions.setHtmlMediaType(HtmlMediaType.Print);
loadOptions.setPageLayoutOption(HtmlPageLayoutOption.ScaleToPageWidth);
Document newDocument = new Document(inputStream, loadOptions); // This line also takes a long time and never completes
System.out.println("Page Count: " + newDocument.getPages().size());
StringBuilder pageData = new StringBuilder();
for (Page tempPage : newDocument.getPages()) {
TextAbsorber ta = new TextAbsorber();
tempPage.accept(ta);
pageData.append(ta.getText());
pageData.append(System.lineSeparator());
}
return pageData.toString();
}
It’s worth noting that the problematic lines in the code seem to take an excessive amount of time to execute. I would appreciate any insights or guidance on how to optimize this process in the JAVA library.