Aspose PDF for Java - issues in converting to HTML

venkylam7 · October 24, 2017, 10:21am

Hi,

I am trying to convert pdf to html using Aspose.PDF library. I referred to the sample in git hub (aspose-pdf’s gists · GitHub) as well as this documentation (https://docs.aspose.com/display/pdfjava/Convert+PDF+to+HTML+format#ConvertPDFtoHTMLformat-PDFtoHTML-Allresourceembeddedinsingleresultantstream) ,

However in the latest version of Aspose.Pdf is 16.11.0 while defining the CustomHTMLSavingStrategy in the sample there is a line
htmlSavingInfo.ContentStream.read(resultHtmlAsBytes, 0, resultHtmlAsBytes.length);

However in the current version the ContextStream does not return an inputStream but returns an internal Aspose class which does not have a read method. Is there sample/documentation which is consistent with the latest version ?

Regards
Venkat

imran.rafique · October 24, 2017, 9:08pm

@venkylam7,

The version 16.11.0 of Aspose.Pdf for Java API is not the latest version. Please download and try the latest version 17.9 of Aspose.Pdf for Java API. With the latest version 17.9, ContentStream instance returns an input stream.

venkylam7 · December 1, 2017, 10:51am

We couldn’t find Aspose.pdf 17.9 version maven dependency, the latest version we see in maven repositories is Aspose.pdf 16.11.0. We are using maven dependencies in our application and our target is to convert a pdf to html, can you please let us know how to go ahead on this to get a maven dependency for 17.9?

imran.rafique · December 1, 2017, 2:42pm

@venkylam7,

Please download the latest version 17.11 of Aspose.Pdf for Java API from the following Maven repository:

<repository>
    <id>AsposeJavaAPI</id>
    <name>Aspose Java API</name>
    <url>http://maven.aspose.com/artifactory/simple/ext-release-local/</url>
</repository> 
<dependency>
    <groupId>com.aspose</groupId>
    <artifactId>aspose-pdf</artifactId>
    <version>17.11</version>
</dependency>

venkylam7 · December 5, 2017, 12:21pm

We are trying to convert pdf to html using ByteArrayOutputStream without creating html file.Please find the below code. But in this approach it is generating a html with external css.

We wanted to generate a html with embedded css ,so added the below line of code to given snippet , but it failed to work.
// Enable option to embed all resources inside the HTML
saveOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;

Can you please suggest us to generate a html(with embedded css without depending on a static html in specific location)?

In generated html external css reference:

<</link rel=“stylesheet” type=“text/css” href=“output_files/style.css”<>

Java Code snippet:

public String convertPDFTOHTML(String observationValue) throws Exception {
	byte[] inputBytes = null;
	try {
		inputBytes = Base64.getDecoder().decode(observationValue);
	} catch (IllegalArgumentException e) {
		inputBytes = observationValue.getBytes();
	}
	InputStream stream = new ByteArrayInputStream(inputBytes);
	ByteArrayOutputStream objBAOS = new ByteArrayOutputStream();
	com.aspose.pdf.Document doc = new com.aspose.pdf.Document(stream);

	HtmlSaveOptions saveOptions = new HtmlSaveOptions();
	saveOptions.setSplitIntoPages(false);
	saveOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy() {
		@Override
		public void invoke(HtmlPageMarkupSavingInfo arg0) {
		}
	};

	saveOptions.CustomResourceSavingStrategy = new HtmlSaveOptions.ResourceSavingStrategy() {
		@Override
		public String invoke(ResourceSavingInfo arg0) {
			// TODO Auto-generated method stub
			return null;
		}
	};

	saveOptions.CustomStrategyOfCssUrlCreation = new HtmlSaveOptions.CssUrlMakingStrategy() {
		@Override
		public String invoke(CssUrlRequestInfo arg0) {
			return null;
		}
	};
	saveOptions.CustomCssSavingStrategy = new HtmlSaveOptions.CssSavingStrategy() {
		@Override
		public void invoke(CssSavingInfo arg0) {

		}
	};

	saveOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
	saveOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
	doc.save(objBAOS, saveOptions);

imran.rafique · December 5, 2017, 8:25pm

@venkylam7,

The above line of code is not available in your code example. Kindly send us your source PDF document. We will investigate and share our findings with you. Please refer to this code example: Save output HTML to a single stream with embedded resources

venkylam7 · December 6, 2017, 5:30am

Hi Imran!! Thanks for your quick response!!

Please find the attached Java code which is generating html and external css. When we added the PartsEmbeddingMode property it is giving runtime error. so commented it in the code and attached the file for reference. we are targeting to generate html file with embedded css without storing html physically.

Runtime error:

class com.aspose.pdf.internal.ms.System.z8: If selected mode of embedding into HTML(PartsEmbeddingModes.EmbedAllIntoHtml), custom CSS saving stategy not allowed and must be null! Please set HtmlSaveOptions.CustomCssSavingStrategy=null!’
com.aspose.pdf.HtmlSaveOptions.m1(Unknown Source)
com.aspose.pdf.z93.m1(Unknown Source)
com.aspose.pdf.ADocument.m1(Unknown Source)
com.aspose.pdf.Document.m1(Unknown Source)
com.aspose.pdf.ADocument.save(Unknown Source)
com.aspose.pdf.Document.save(Unknown Source)
com.globalhealth.referralnet.agent.rtf.Impl.ex.convertPDFTOHTML(ex.java:82)
com.globalhealth.referralnet.agent.rtf.Impl.ex.main(ex.java:20)
at com.aspose.pdf.HtmlSaveOptions.m1(Unknown Source)
at com.aspose.pdf.z93.m1(Unknown Source)
at com.aspose.pdf.ADocument.m1(Unknown Source)
at com.aspose.pdf.Document.m1(Unknown Source)
at com.aspose.pdf.ADocument.save(Unknown Source)
at com.aspose.pdf.Document.save(Unknown Source)
at com.globalhealth.referralnet.agent.rtf.Impl.ex.convertPDFTOHTML(ex.java:82)
at com.globalhealth.referralnet.agent.rtf.Impl.ex.main(ex.java:20)

image.png (162.4 KB)
pdftohtml.zip (148.4 KB)

imran.rafique · December 6, 2017, 2:25pm

@venkylam7,

You have sent a picture of the HTML document. Kindly send a Zip of the HTML document. You might paste the HTML string in the reply post.

venkylam7 · December 7, 2017, 9:09am

pdftohtml.zip (149.7 KB)

Hi Please find the attachment which have zip which contains code, output and runtime error when enabling PartsEmbeddingMode property!!

imran.rafique · December 7, 2017, 12:29pm

@venkylam7,

As we narrated in our earlier post, we also require your source PDF document. We recommend our clients to share all details of the scenario, so that we could replicate the same error in our environment. Your response is awaited.

venkylam7 · January 5, 2018, 12:21pm

We are converting a pdf to HTML using Aspose PDF in Java. We are able to convert it successfully when pdf has no images, but if PDF contains images it gives exception. Please find our code and the exception we are getting. Can you please help us in fixing in the below exception which says " java.lang.ClassNotFoundException: javax.imageio.event.IIOWriteProgressListener not found by com.aspose.pdf "?

Exception:
java.lang.NoClassDefFoundError: javax/imageio/event/IIOWriteProgressListener
at com.aspose.pdf.internal.p822.z1.m9(Unknown Source)
at com.aspose.pdf.internal.p822.z1.m1(Unknown Source)
at com.aspose.pdf.internal.p822.z1.m5(Unknown Source)
at com.aspose.pdf.internal.p822.z1.m2(Unknown Source)
at com.aspose.pdf.internal.p781.z30.m1(Unknown Source)
at com.aspose.pdf.internal.p781.z30.m1(Unknown Source)
at com.aspose.pdf.internal.p121.z7.(Unknown Source)
at com.aspose.pdf.internal.p120.z14.m1(Unknown Source)
at com.aspose.pdf.internal.p52.z1.m1(Unknown Source)
at com.aspose.pdf.internal.p38.z21.m2(Unknown Source)
at com.aspose.pdf.internal.p38.z21.m4(Unknown Source)
at com.aspose.pdf.internal.p38.z7.m2(Unknown Source)
at com.aspose.pdf.internal.p38.z7.m2(Unknown Source)
at com.aspose.pdf.ApsUsingConverter.m1(Unknown Source)
at com.aspose.pdf.z93.m1(Unknown Source)
at com.aspose.pdf.ADocument.save(Unknown Source)
at com.aspose.pdf.Document.save(Unknown Source)
at com.globalhealth.referralnet.agent.rtf.Impl.RTFTransformationServiceImpl.convertPDFTOHTML(RTFTransformationServiceImpl.java:461)
at com.globalhealth.referralnet.agent.servlet.HL7MessageViewer.convertCDAtoHtml(HL7MessageViewer.java:304)[240:com.globalhealth.referralnet.agent.rna-admin-servlets:2.1.0.SNAPSHOT]
at com.globalhealth.referralnet.agent.servlet.HL7MessageViewer.prepareMessageOBXData(HL7MessageViewer.java:216)[240:com.globalhealth.referralnet.agent.rna-admin-servlets:2.1.0.SNAPSHOT]
at com.globalhealth.referralnet.agent.servlet.HL7MessageViewer.prepareMessageJSON(HL7MessageViewer.java:123)[240:com.globalhealth.referralnet.agent.rna-admin-servlets:2.1.0.SNAPSHOT]
at com.globalhealth.referralnet.agent.servlet.HL7MessageViewer.doPost(HL7MessageViewer.java:84)[240:com.globalhealth.referralnet.agent.rna-admin-servlets:2.1.0.SNAPSHOT]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)[61:javax.servlet-api:3.1.0]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)[61:javax.servlet-api:3.1.0]
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)[87:org.eclipse.jetty.servlet:9.2.15.v20160210]
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)[87:org.eclipse.jetty.servlet:9.2.15.v20160210]
at org.ops4j.pax.web.service.jetty.internal.HttpServiceServletHandler.doHandle(HttpServiceServletHandler.java:71)[112:org.ops4j.pax.web.pax-web-jetty:4.2.6]
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)[86:org.eclipse.jetty.server:9.2.15.v20160210]
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)[85:org.eclipse.jetty.security:9.2.15.v20160210]
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)[86:org.eclipse.jetty.server:9.2.15.v20160210]
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)[86:org.eclipse.jetty.server:9.2.15.v20160210]
at org.ops4j.pax.web.service.jetty.internal.HttpServiceContext.doHandle(HttpServiceContext.java:276)[112:org.ops4j.pax.web.pax-web-jetty:4.2.6]
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)[87:org.eclipse.jetty.servlet:9.2.15.v20160210]
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)[86:org.eclipse.jetty.server:9.2.15.v20160210]
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)[86:org.eclipse.jetty.server:9.2.15.v20160210]
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)[86:org.eclipse.jetty.server:9.2.15.v20160210]
at org.ops4j.pax.web.service.jetty.internal.JettyServerHandlerCollection.handle(JettyServerHandlerCollection.java:80)[112:org.ops4j.pax.web.pax-web-jetty:4.2.6]
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)[86:org.eclipse.jetty.server:9.2.15.v20160210]
at org.eclipse.jetty.server.Server.handle(Server.java:499)[86:org.eclipse.jetty.server:9.2.15.v20160210]
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)[86:org.eclipse.jetty.server:9.2.15.v20160210]
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)[86:org.eclipse.jetty.server:9.2.15.v20160210]
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)[78:org.eclipse.jetty.io:9.2.15.v20160210]
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)[89:org.eclipse.jetty.util:9.2.15.v20160210]
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)[89:org.eclipse.jetty.util:9.2.15.v20160210]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_121]
Caused by: java.lang.ClassNotFoundException: javax.imageio.event.IIOWriteProgressListener not found by com.aspose.pdf [183]
at org.apache.felix.framework.BundleWiringImpl.findClassOrResourceByDelegation(BundleWiringImpl.java:1574)[org.apache.felix.framework-5.4.0.jar:]
at org.apache.felix.framework.BundleWiringImpl.access$400(BundleWiringImpl.java:79)[org.apache.felix.framework-5.4.0.jar:]
at org.apache.felix.framework.BundleWiringImpl$BundleClassLoader.loadClass(BundleWiringImpl.java:2018)[org.apache.felix.framework-5.4.0.jar:]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)[:1.8.0_121]
… 45 more

Code Snippet:

public String convertPDFTOHTML(InputStream stream) throws Exception {
	// Output file path
	UUID id = UUID.randomUUID();
	String outHtmlFileName = "PDFtoHTML_" + id + ".html";
	String outHtmlFile = System.getProperty("karaf.data") + File.separator + outHtmlFileName;
	String encodedString = null;
	InputStream htmlInputStream = null;
	ByteArrayOutputStream baos = new ByteArrayOutputStream();
	try {
		Logger.getLogger(this.getClass()).info("Output file path for pdf to html " + outHtmlFile);

		com.aspose.pdf.Document doc = new com.aspose.pdf.Document(stream);
		HtmlSaveOptions newOptions = new HtmlSaveOptions();
		// Enable option to embed all resources inside the HTML
		newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
		// This is just optimization for IE and can be omitted
		newOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
		newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
		newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
		// Save the output file
		doc.save(outHtmlFile, newOptions);

		File outHTMLFile = new File(outHtmlFile);
		htmlInputStream = new FileInputStream(outHTMLFile);
		byte[] buffer = new byte[htmlInputStream.available()];
		int length = 0;
		while ((length = htmlInputStream.read(buffer)) > 0) {
			baos.write(buffer, 0, length);
		}
		byte[] bytes = baos.toByteArray();
		byte[] encoded = Base64.getEncoder().encode(bytes);
		encodedString = new String(encoded);
	} catch (Exception e) {
		Logger.getLogger(RNAUserLogger.class).error("Exception while converting the pdf to html ", e);
		throw e;
	} finally {

		if (baos != null) {
			baos.close();
		}
		if (stream != null) {
			stream.close();
		}
		if (htmlInputStream != null) {
			htmlInputStream.close();
		}
		deleteCreatedPDFtoHTMLFile(outHtmlFile);
	}
	return encodedString;
}

imran.rafique · January 5, 2018, 11:47pm

@venkylam7,

We require the source PDF document to replicate the same error in our environment. Your response is awaited.