How to save Document to ByteArrayOutputStream and LoadDocument from ByteArray

ravi.narsini · September 14, 2023, 5:20pm

Hi, I’m facing an issue with a PDF document that contains repeated or garbled text when using the TextAbsorber.Text property in the .NET library. To address this, I attempted a workaround by converting the PDF to HTML, loading the HTML document as a PDF, and then extracting the clean text using TextAbsorber. This approach worked successfully in the .NET library. However, I encountered difficulties when implementing the same logic in the JAVA library. Below is the code I used:

private static String getConvertedPageText(Page page) {
    System.out.println("Due to repeated words - Page converting PDF -> HTML -> PDF");
    Document onePageDocument = new Document();
    onePageDocument.getPages().add(page);

    HtmlSaveOptions saveOptions = new HtmlSaveOptions();
    saveOptions.setPartsEmbeddingMode(HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml);
    saveOptions.setRasterImagesSavingMode(RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground);
    
    ByteArrayOutputStream outStream = new ByteArrayOutputStream();
    BufferedOutputStream bos = new BufferedOutputStream(outStream);
    onePageDocument.save(bos, saveOptions); // This line takes more than 10 minutes
    byte[] byteArray = outStream.toByteArray();

    InputStream inputStream = new ByteArrayInputStream(byteArray, 0, byteArray.length);
    HtmlLoadOptions loadOptions = new HtmlLoadOptions();
    loadOptions.setHtmlMediaType(HtmlMediaType.Print);
    loadOptions.setPageLayoutOption(HtmlPageLayoutOption.ScaleToPageWidth);

    Document newDocument = new Document(inputStream, loadOptions); // This line also takes a long time and never completes

    System.out.println("Page Count: " + newDocument.getPages().size());

    StringBuilder pageData = new StringBuilder();
    for (Page tempPage : newDocument.getPages()) {
        TextAbsorber ta = new TextAbsorber();
        tempPage.accept(ta);
        pageData.append(ta.getText());
        pageData.append(System.lineSeparator());
    }
    return pageData.toString();
}

It’s worth noting that the problematic lines in the code seem to take an excessive amount of time to execute. I would appreciate any insights or guidance on how to optimize this process in the JAVA library.

asad.ali · September 14, 2023, 9:18pm

@ravi.narsini

Please try to use the latest version of the API. Along with it, you can increase the Java Heap Size by setting the -xms variable. In case issue keeps persisting, please share your sample document with us so that we can test the scenario in our environment and address it accordingly.

ravi.narsini · September 15, 2023, 12:38pm

Thank you for your feedback, Asad.
Please find the attached PDF document I am currently working on, along with a sample .NET code that is functioning correctly.f1.pdf (1.2 MB)


foreach (Page page in document.Pages)
{
    Document onePageDocumet = new Document();
    onePageDocumet.Pages.Add(page);

    HtmlSaveOptions saveOptions = new HtmlSaveOptions();
    saveOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
    saveOptions.RasterImagesSavingMode = RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
    Stream st = new MemoryStream();
    onePageDocumet.Save(st, saveOptions);

    st.Seek(0, SeekOrigin.Begin);
    HtmlLoadOptions loadOptions = new HtmlLoadOptions();
    loadOptions.HtmlMediaType = HtmlMediaType.Print;
    loadOptions.PageLayoutOption = HtmlPageLayoutOption.ScaleToPageWidth;
    Document newDocument = new Document(st, loadOptions);
    foreach (Page singlePage in newDocument.Pages)
    {
        TextAbsorber textAbsorber = new TextAbsorber();
        singlePage.Accept(textAbsorber);
        string[] stringSeparators = new string[] { "\r\n" };
        string[] lines = textAbsorber.Text.Split(stringSeparators, StringSplitOptions.RemoveEmptyEntries);
        for (int i = 0; i < lines.Length; i++)
        {
            file.WriteLine($"{pageNumber}.{i}:{lines[i]}");
        }
    }
    pageNumber++;
}

Thanks,
Ravi Narsini
+91 9949422930

asad.ali · September 15, 2023, 7:25pm

@ravi.narsini

We tested using the same Java code that you shared with 23.8 version of the API in our environment. The code was executed in 20 seconds. Can you please make sure that you are using the latest version? If issue is still persisting, please share the JDK version you are using as well as the Java Heap size. Attached in the output for your kind reference.outputtext.zip (8.8 KB)

ravi.narsini · September 16, 2023, 6:29am

Thats great! I was using 22.9. Llet me check with 23.8 then. Thanks for the update @asad.ali