Pdf file conversion to HTML is too slow, often stuck, conversion is not in the past

booway · July 11, 2017, 8:38am

Pdf conversion to HTML file can support multi-threaded conversion, conversion is too consumed resources, and the conversion is very, very slow, often not converted to the past, directly stuck there
code is:

Document doc = new Document(“1.pdf”);
doc.save(getParam(ParamConstrant.TARGETPATH, String.class), getHtmlSaveOptions(getParam(ParamConstrant.TARGETPATH, String.class)));

protected HtmlSaveOptions getHtmlSaveOptions(final String targetPath)
{
HtmlSaveOptions newOptions = new HtmlSaveOptions();
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
// 这个地方是控制, 图片是否压入的地方
newOptions.PartsEmbeddingMode = DopConfig.getBoolean(ParamConstrant.COMPRESSIMAGE, false) ? HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml : HtmlSaveOptions.PartsEmbeddingModes.EmbedCssOnly;
newOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.setSplitIntoPages(false);
newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy()
{
@Override
public void invoke(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
byte[] resultHtmlAsBytes = new byte[(int) htmlSavingInfo.ContentStream.getLength()];
htmlSavingInfo.ContentStream.read(resultHtmlAsBytes, 0, resultHtmlAsBytes.length);
FileOutputStream fos = null;
try
{
LOG.info(“开始写入Html[” + targetPath + “]文件…”);
// 考虑编码
fos = new FileOutputStream(targetPath);
fos.write(resultHtmlAsBytes);
fos.flush();
LOG.info(“Html[” + targetPath + “]文件写入完成…”);
} catch (Exception e)
{
e.printStackTrace();
} finally
{
try
{
if (null != fos)
{
fos.close();
}
} catch (Exception e)
{
LOG.error(e.getMessage());
}
}
}
};
return newOptions;
}

asad.ali · July 11, 2017, 1:49pm

@booway

Thanks for contacting support.

Would you please share the input PDF document which you are using for conversion process? We will test the scenario in our environment and address it accordingly.

Best Regards,
Asad Ali

booway · July 28, 2017, 7:38am

The document is about 21MB and cannot be uploaded to the forum. You can provide the mailbox and I’ll send it to you

asad.ali · July 28, 2017, 2:10pm

@booway

Thanks for contacting support.

In case if you have larger size document, you can upload it to some public file sharing service (e.g Dropbox, Google Drive) and share the link here. We will test the scenario in our environment address it accordingly.

booway · July 31, 2017, 12:54am

Pdf sample file is not important, according to our test here, is because the conversion after consumption of memory GC no memory recovery, you can easily find a number of larger (more than 20M) PDF file, uninterrupted to convert PDF to HTML, it will not be long before the discovery of memory consumption. The long time memory is not released (here we test 6 hours after the completion of conversion, memory has not been released), personal speculation should be the internal components appeared memory leak。

asad.ali · July 31, 2017, 11:20am

@booway

Thanks for contacting support.

I have tested the scenario with one of my sample PDFs (25MB of size) and was able to notice that the code kept running resulting an OutOfMemoryError. Hence I have logged this issue as PDFJAVA-36955 in our issue tracking system. We will further investigate the issue and keep you updated with the status of its resolution. Please be patient and spare us little time.

We are sorry for the inconvenience.