Java: HTML to PDF parsing errors

Inception · June 7, 2010, 3:46am

I am trying to convert entire HTML pages to PDF. I keep getting parsing errors. My code looks something like this.

Pdf pdf = new Pdf();
Section section = pdf.getSections().add();

Text HTMLText1 = new Text(HTMLTextFromFile);
HTMLText1.setIsHtmlTagSupported(true);
section.getParagraphs().add(HTMLText1);

String outFilePDF = "d:/pdftest/SamplePDF_HTMLTest.pdf";
pdf.save(outFilePDF);

My HTML pages start with a xhtml1-transitional doctype

...

The error I am getting is:

[Fatal Error] :1:16: A DOCTYPE is not allowed in content.

org.xml.sax.SAXParseException: A DOCTYPE is not allowed in content.

at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.

java:264)

at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Doc

umentBuilderImpl.java:292)

at aspose.pdf.xml.h.a(SourceFile:428)

at aspose.pdf.xml.h.a(SourceFile:2388)

at aspose.pdf.xml.ao.a(SourceFile:441)

at aspose.pdf.xml.n.a(SourceFile:759)

at aspose.pdf.xml.P.a(SourceFile:105)

at aspose.pdf.xml.w.a(SourceFile:112)

at aspose.pdf.Pdf.save(SourceFile:1142)

When I remove the doctype I get the following error:

[Fatal Error] :1:1793: The entity name must immediately follow the '&' in the en

tity reference.

org.xml.sax.SAXParseException: The entity name must immediately follow the '&' i

n the entity reference.

at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.

java:264)

at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Doc

umentBuilderImpl.java:292)

at aspose.pdf.xml.h.a(SourceFile:428)

at aspose.pdf.xml.h.a(SourceFile:2388)

at aspose.pdf.xml.ao.a(SourceFile:441)

at aspose.pdf.xml.n.a(SourceFile:759)

at aspose.pdf.xml.P.a(SourceFile:105)

at aspose.pdf.xml.w.a(SourceFile:112)

at aspose.pdf.Pdf.save(SourceFile:1142)

A little help is appreciated. What am I doing wrong?

codewarior · June 8, 2010, 12:09am

Hello Ron,

Thanks for using our products.

I have tested the scenario using Aspose.Pdf for Java 2.6.0 and JDK 1.6.0_20 over Windows XP platform while converting the following HTML contents and I am unable to notice the problem. I have used the following code snippet to read the contents of HTML file and perform the transformation.

[HTML]

<html xmlns="http://www.w3.org/1999/xhtml">

sample text

[Java]

StringBuffer sb = new StringBuffer(1024);

BufferedReader reader = new BufferedReader(new FileReader("d:/pdftest/html_contents_.html"));
char[] chars = new char[1024];

int numRead = 0;
while ((numRead = reader.read(chars)) > -1)
{
sb.append(String.valueOf(chars));
}
reader.close();

//Instantiate Pdf pbject by calling its empty constructor
Pdf pdf1 = new Pdf();
//Create a new section in the Pdf object
Section sec1 = pdf1.getSections().add();

//Create a new text paragraph and pass the text to its constructor as argument
Text text1 = new Text(sec1, sb.toString());
text1.setIsHtmlTagSupported(true);
sec1.getParagraphs().add(text1);
pdf1.save("d:/pdftest/HTMLSample_File_to_PDF.pdf");

HTML to PDF conversion is in beta version and I am afraid the tag I currently not supported. More along, in order to convert the HTML file into PDF format, the HTML file must be in proper format (proper starting and ending tags).

In case you still face any problem or you have any further query, please feel free to contact. We apologize for your inconvenience.

PS, I am using a console application.