We use aspose.pdf to convert PDF files to HTML files, and the conversion process is very, very slow

booway · July 27, 2017, 3:30am

粘贴图片11(07-27-11-39-46).png (103.6 KB)
粘贴图片(07-27-11-39-46).png (74.8 KB)

We use aspose.pdf to convert PDF files to generate HTML files, the conversion process is very slow, and the process of converting the memory consumption is very large, if converted very large PDF files, often occupy a very large memory, the memory consumption will not be released at the end of conversion, often out of memory in the annex, the picture is our memory test, a 20MB conversion of the PDF file conversion consumes four hours

asad.ali · July 27, 2017, 11:41am

@booway

Thanks for contacting support.

Would you please provide us your sample PDF document along with the code snippet and environment details (i.e API Version, Application Type, Operating System Info, Target Framework, etc), so that we can test the scenario in our environment and address it accordingly.

booway · July 28, 2017, 7:35am

package com.booway.aspose.convert;

import java.io.FileOutputStream;
import java.util.ArrayList;
import java.util.List;

import com.aspose.pdf.Document;
import com.aspose.pdf.HtmlSaveOptions;
import com.aspose.pdf.LettersPositioningMethods;
import com.booway.aspose.analysis.constrant.ParamConstrant;
import com.booway.aspose.convert.flow.FlowStrategy;
import com.booway.aspose.convert.flow.pdf.PdfHtmlTitleFlow;
import com.booway.common.utils.StreamUtil;
import com.booway.dop.config.DopConfig;

/**

PDF文件转换成html
@author JIE

/
public class PdfConvert extends BaseConvertImpl
{
@Override
public ConvertType getConvertType()
{
return ConvertType.PDF;
}
@Override
public List<Class<? extends FlowStrategy>> getFlowStrategy()
{
List<Class<? extends FlowStrategy>> flowStrategys = new ArrayList<Class<? extends FlowStrategy>>();
flowStrategys.add(PdfHtmlTitleFlow.class);
return flowStrategys;
}
@Override
public void doAnalysis() throws Exception
{
// 转换成pdf文档对象即可
Document doc = new Document(getParam(ParamConstrant.SOURCEPATH, String.class));
addParam(ParamConstrant.DOCUMENT, doc);
}
@Override
public void doConvert() throws Exception
{
Document doc = getParam(ParamConstrant.DOCUMENT, Document.class);
if (null != doc)
{
doc.save(getParam(ParamConstrant.TARGETPATH, String.class), getHtmlSaveOptions(getParam(ParamConstrant.TARGETPATH, String.class)));
}
}
/*
* 获取HtmlSaveOptions
* @return
*/
protected HtmlSaveOptions getHtmlSaveOptions(final String targetPath)
{
HtmlSaveOptions newOptions = new HtmlSaveOptions();
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
// 这个地方是控制, 图片是否压入的地方
newOptions.PartsEmbeddingMode = DopConfig.getBoolean(ParamConstrant.COMPRESSIMAGE, false) ? HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml : HtmlSaveOptions.PartsEmbeddingModes.EmbedCssOnly;
newOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.setSplitIntoPages(false);
newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy()
{
@Override
public void invoke(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
{
byte[] resultHtmlAsBytes = new byte[(int) htmlSavingInfo.ContentStream.getLength()];
htmlSavingInfo.ContentStream.read(resultHtmlAsBytes, 0, resultHtmlAsBytes.length);
FileOutputStream fos = null;
try
{
LOG.info(“开始写入Html[” + targetPath + “]文件…”);
// 考虑编码
fos = new FileOutputStream(targetPath);
fos.write(resultHtmlAsBytes);
fos.flush();
LOG.info(“Html[” + targetPath + “]文件写入完成…”);
} catch (Exception e)
{
LOG.error("", e);
} finally
{
StreamUtil.closeStream(fos);
}
}
};
return newOptions;
}
}

booway · July 28, 2017, 7:36am

aspose.pdf version is aspose.pdf-11.4.0

booway · July 28, 2017, 7:40am

The document is probably 21MB and cannot be uploaded to the forum. You can provide the mailbox, I send it to you, or you can provide a FTP space that can upload large attachments

asad.ali · July 28, 2017, 2:13pm

@booway

Thanks for contacting support.

In case if you have larger size document, you can upload it to some public file sharing service (e.g Dropbox, Google Drive) and share the link here. We will test the scenario in our environment address it accordingly.

booway · July 31, 2017, 12:33am

I’m from China, and I can’t use the service provided by Google in China. Can you visit the Baidu http://pan.baidu.com/? I can upload the files to it， The biggest problem is that when the large PDF file is converted to the HTML file, the memory will always be occupied, not released, and constantly converted until memory overflows

booway · July 31, 2017, 2:56am

粘贴图片.png (182.1 KB)

Aspose.pdf converts PDF files into HTML files that generate large amounts of temporary files without being deleted

asad.ali · July 31, 2017, 11:17am

@booway

Thanks for writing back.

Would you please share the environment details, i.e your application type, development environment, JDK version, etc. So that we can also observe the performance issue in specified environment.

I am able to access this website, please upload your files there and share the link, so that we can test the scenario with your specific document as well.

booway · August 1, 2017, 5:55am

I uploaded the files we tested to Baidu cloud
Url: http://pan.baidu.com/s/1hrFM4uw
Pwd: 9jv4
We will file for continuous conversion (conversion, a file and a file without concurrent conversion, before the test had concurrent conversion, that memory can’t stand), the entire conversion process we found that the memory will not be released, resulting in memory are straight up, not down. Memory overflows are eventually caused.
Test environment:
Os: win server 64bit
Tomcat: tomcat7
Jdk: 1.7
Tomcat, arguments:, set, CATALINA_OPTS=-server, -Xms1024m, -Xmx10240m, -XX:PermSize=128M, -XX:MaxPermSize=2048M

asad.ali · August 1, 2017, 12:56pm

@booway

Thanks for sharing environment details.

We are setting up an environment to test the scenario and will get back to you shortly. Meanwhile, would you please check the link which you have shared because it was giving 404 Not Found Error, when I tried to open it.

booway · August 2, 2017, 1:56am

asad.ali · August 2, 2017, 11:57am

@booway

Thanks for sharing sample documents.

We are testing the scenario in our environment and will get back to you with our findings as soon as possible. Please be patient.

asad.ali · August 2, 2017, 12:46pm

@booway

Thanks for your patience.

I have tested the scenario in an environment i.e Eclipse Neon.2 Release (4.6.2), Apache Tomcat Server 7.0, JRE 1.8, with Aspose.Pdf for Java 17.6 and observed that the code execution took more than an hour, resulting OutOfMemoryError. The CPU usage was 100% throughout the conversion process and Memory Consumption was 70%-80%.

Therefore, I have logged an issue as PDFJAVA-36958 in our issue tracking system. We will further investigate this issue and keep you posted with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.

booway · August 3, 2017, 12:10am

Thank you very much.