In memory PDF to HTML conversion

guo.maleo · September 19, 2017, 4:09pm

Hi,
We are evaluating the ASPOSE products, and we have a problem with PDF to HTML conversion.

We have millions of files in doc/docx/pdf in database and we run a batch job to read those files from database, convert them into html and then write the htmls to the database.

We are able to distinguish the file type and based on the file type, we use either Aspose.word or Aspose PDF. As we don’t store the files, so all the conversions are in memory. Aspose words works great to convert the word files to html, but with Aspose PDF, we have the problem that after processing several hundred records, we get the following JVM error and the batch job gets terminated:

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j sun.font.T2KFontScaler.getGlyphImageNative(Lsun/font/Font2D;JJI)J+0
j sun.font.T2KFontScaler.getGlyphImage(JI)J+26
j sun.font.FileFont.getGlyphImage(JI)J+6
j sun.font.FileFontStrike.getGlyphImagePtr(I)J+115
j sun.font.FileFontStrike.getGlyphMetrics(IZ)Ljava/awt/geom/Point2D$Float;+29
j sun.font.FileFontStrike.getGlyphMetrics(I)Ljava/awt/geom/Point2D$Float;+3
v ~StubRoutines::call_stub
j sun.font.SunLayoutEngine.nativeLayout(Lsun/font/Font2D;Lsun/font/FontStrike;[FII[CIIIIIIILjava/awt/geom/Point2D$Float;Lsun/font/GlyphLayout$GVData;JJ)V+0
j sun.font.SunLayoutEngine.layout(Lsun/font/FontStrikeDesc;[FIILsun/font/TextRecord;ILjava/awt/geom/Point2D$Float;Lsun/font/GlyphLayout$GVData;)V+98
j sun.font.GlyphLayout$EngineRecord.layout()V+95
j sun.font.GlyphLayout.layout(Ljava/awt/Font;Ljava/awt/font/FontRenderContext;[CIIILsun/font/StandardGlyphVector;)Lsun/font/StandardGlyphVector;+541
j sun.font.ExtendedTextSourceLabel.createGV()Lsun/font/StandardGlyphVector;+63
j sun.font.ExtendedTextSourceLabel.getGV()Lsun/font/StandardGlyphVector;+9
j sun.font.ExtendedTextSourceLabel.createLogicalBounds()Ljava/awt/geom/Rectangle2D;+1
j sun.font.ExtendedTextSourceLabel.getAdvance()F+9
j java.awt.font.TextLine.init()V+609
j java.awt.font.TextLine.(Ljava/awt/font/FontRenderContext;[Lsun/font/TextLineComponent;[F[CII[I[BZ)V+79
j java.awt.font.TextMeasurer.makeTextLineOnRange(II)Ljava/awt/font/TextLine;+78
j java.awt.font.TextMeasurer.getLayout(II)Ljava/awt/font/TextLayout;+26
j com.aspose.pdf.internal.p784.z18.m17()Lcom/aspose/pdf/internal/p784/z16;+411
j com.aspose.pdf.internal.p784.z18.m15()Lcom/aspose/pdf/internal/p784/z16;+215
j com.aspose.pdf.internal.p779.z42.m1(Ljava/lang/String;Lcom/aspose/pdf/internal/p779/z15;Lcom/aspose/pdf/internal/p779/z191;FF[IZZZ)Lcom/aspose/pdf/internal/p779/z185;+735
j com.aspose.pdf.internal.p779.z42.m1(Ljava/lang/String;Lcom/aspose/pdf/internal/p779/z15;Lcom/aspose/pdf/internal/p779/z191;FF[I)Lcom/aspose/pdf/internal/p779/z185;+13
j com.aspose.pdf.internal.p779.z42.m1(Ljava/lang/String;Lcom/aspose/pdf/internal/p779/z15;Lcom/aspose/pdf/internal/p779/z40;Lcom/aspose/pdf/internal/p779/z191;)Lcom/aspose/pdf/internal/p779/z185;+10
j com.aspose.pdf.internal.p229.z35.m1(Ljava/lang/String;Lcom/aspose/pdf/internal/p779/z15;)Lcom/aspose/pdf/internal/p779/z185;+73
j com.aspose.pdf.internal.p229.z35.m1(Ljava/lang/String;Lcom/aspose/pdf/internal/p232/z5;)Lcom/aspose/pdf/internal/p779/z185;+15
J com.aspose.pdf.internal.p180.z9.m4(Lcom/aspose/pdf/internal/p182/z10;Lcom/aspose/pdf/internal/p182/z10;)Z
J com.aspose.pdf.internal.p180.z9.m1(Lcom/aspose/pdf/internal/p182/z10;Lcom/aspose/pdf/internal/p182/z10;)Z
J com.aspose.pdf.internal.p180.z10.m2(Lcom/aspose/pdf/internal/p182/z28;)Z
j com.aspose.pdf.internal.p180.z10.m1(Lcom/aspose/pdf/internal/p182/z28;)V+14
j com.aspose.pdf.internal.p180.z10.m1(Lcom/aspose/pdf/internal/p182/z28;F)V+15
j com.aspose.pdf.internal.p180.z5.m1(Lcom/aspose/pdf/internal/p182/z28;)V+7
j com.aspose.pdf.internal.p177.z1.m1(Lcom/aspose/pdf/internal/p176/z14;FFIZLcom/aspose/pdf/internal/p177/z1;)Lcom/aspose/pdf/internal/p182/z4;+426
j com.aspose.pdf.internal.p177.z1.m1(Lcom/aspose/pdf/internal/p176/z14;FFLcom/aspose/pdf/internal/p177/z1;)Lcom/aspose/pdf/internal/p182/z4;+19
j com.aspose.pdf.internal.p177.z1.m1(Lcom/aspose/pdf/internal/p176/z14;Lcom/aspose/pdf/internal/p779/z185;)V+11
j com.aspose.pdf.internal.p176.z3.m1(Lcom/aspose/pdf/internal/ms/System/Collections/Generic/z16;Lcom/aspose/pdf/internal/p179/z3;Lcom/aspose/pdf/internal/p176/z21;)V+195
j com.aspose.pdf.internal.p156.z9.m1(Ljava/lang/String;Lcom/aspose/pdf/internal/p157/z2;Lcom/aspose/pdf/internal/ms/System/Collections/Generic/IGenericList;Lcom/aspose/pdf/internal/p156/z16;Lcom/aspose/pdf/internal/foundation/rendering/z46;)V+128
j com.aspose.pdf.z94.m1(Lcom/aspose/pdf/ApsUsingConverter$z1;Ljava/lang/String;Lcom/aspose/pdf/internal/ms/System/IO/Stream;ZLcom/aspose/pdf/HtmlSaveOptions;)V+100
j com.aspose.pdf.z94.m1(Lcom/aspose/pdf/IDocument;Ljava/lang/String;Lcom/aspose/pdf/internal/ms/System/IO/Stream;Lcom/aspose/pdf/HtmlSaveOptions;)V+207
j com.aspose.pdf.ADocument.save(Ljava/lang/String;Lcom/aspose/pdf/SaveOptions;)V+117
j com.aspose.pdf.Document.save(Ljava/lang/String;Lcom/aspose/pdf/SaveOptions;)V+3
j com.rhi.spotlight.ResumeConverterMap.convertPdf(Lcom/aspose/pdf/Document;)Ljava/util/Map;+130
j com.rhi.spotlight.ResumeConverterMap.convertResume([BLjava/lang/String;)Ljava/util/Map;+302
j com.rhi.spotlight.ResumeConverterMap.handleRequest(Lcom/wccgroup/edr/api/maps/LookupRequest;)V+157
j com.rhi.spotlight.ThreadedBaseMap$1.run()V+8
j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j java.lang.Thread.run()V+11
v ~StubRoutines::call_stub

It looks like it has something to do with the fonts, I am not 100% sure, actually we don’t care about the fonts in the PDF. so my questions are:

Is there a way to fix the issue above?

In order to get optimized storage, we don’t want to keep the fonts in PDF, how can we do that if we just want to do the in-memory conversion, below is the code I used to convert the PDF to HTML and extract the plain text from the PDF:

public static Map<String, String> convertPdf(com.aspose.pdf.Document pdf) {
Map<String, String> result = new HashMap<String, String>();
String text = “”;
String html = “”;
TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.setExtractionOptions(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.getExtractionOptions().setScaleFactor((double) 0.5);
pdf.getPages().accept(textAbsorber);
text = textAbsorber.getText();
ByteArrayOutputStream htmlStream = new ByteArrayOutputStream();

 com.aspose.pdf.HtmlSaveOptions pdf2HtmlOptions = new com.aspose.pdf.HtmlSaveOptions(HtmlDocumentType.Html5);
 pdf2HtmlOptions.RasterImagesSavingMode = com.aspose.pdf.HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
 pdf2HtmlOptions.PartsEmbeddingMode = com.aspose.pdf.HtmlSaveOptions.PartsEmbeddingModes.EmbedCssOnly;
 pdf2HtmlOptions.FontSavingMode = com.aspose.pdf.HtmlSaveOptions.FontSavingModes.AlwaysSaveAsWOFF;
 pdf2HtmlOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
 pdf2HtmlOptions.setSplitIntoPages(false);
 pdf2HtmlOptions.CustomHtmlSavingStrategy = new InMemoryHtmlPageMarkupSavingStrategy(htmlStream);
 
 pdf2HtmlOptions.CustomResourceSavingStrategy = new ResourceSavingStrategy() {
     @Override
     public String invoke(ResourceSavingInfo savingInfo) {
         return null;
     }
     
 };

 pdf.save("dummy.html", pdf2HtmlOptions);
         
 result.put("text", text);
 result.put("html", htmlStream.toString());
 try {
     htmlStream.close();
     pdf.dispose();
     com.aspose.pdf.MemoryCleaner.clearAllTempFiles();
 } catch (IOException e) {
 }
 return result;

}

imran.rafique · September 20, 2017, 4:02am

@guo.maleo,

It is difficult to say anything about the JVM crash before the investigation. We would suggest you please try with the latest JVM and Aspose.Pdf for Java 17.8. However, if this does not help, then kindly create a small application project, which reproduces this error in your environment, and then send us a Zip of this project. We will investigate and share our findings with you.

You can remove embedded fonts from the PDF, please refer to this help topic: Optimize PDF file size.

guo.maleo · September 21, 2017, 7:45pm

We have a small program running also, but this issue doesn’t happen on it.

Another option is ignore the font, basically during the conversion, what we care is about the layout and the content, we can even ignore the image in the PDF, I have been searching on the internet, but I don’t find too much useful information.

I like your product, and it is doing great with doc/docx/rtf format, but I still have the concern about the pdf before I make the final decision.
Basically, my problem is we need it to be able to convert the pdf to html in memory, ignoring the fonts, even images.

imran.rafique · September 21, 2017, 10:21pm

@guo.maleo,
After loading an input PDF to Document class object, you can remove embedded fonts, graphic objects and images. The OptimizationOptions class offers UnembedFonts property to remove all embedded fonts. In order to remove images you can iterate through the image resources of each page, and XImageCollection class offers a Delete method which takes an index of the image. Please refer to these help topics: Optimize PDF Document, Delete Images from a PDF file and Remove Graphics objects using operator classes

Remove images:
[C#]

// Open document
Document pdfDocument = new Document(dataDir + "DeleteImages.pdf");
    foreach (Aspose.Pdf.Page page in pdfDocument.Pages)
        for (int i = 1; i <= page.Resources.Images.Count; i++)
            page.Resources.Images.Delete(i);
pdfDocument.Save(dataDir + "Output.pdf");

guo.maleo · September 25, 2017, 5:17pm

Thanks a lot for your quick response.

I tried the approach you mentioned, it looks like it is still generating the font files if I enable the following option:
pdf2HtmlOptions.PartsEmbeddingMode = com.aspose.pdf.HtmlSaveOptions.PartsEmbeddingModes.EmbedCssOnly;

Ideally, we just need convert the PDF into HTML without any fonts, either embedded in the html or not, but css is important to us as we don’t want to break the layout.

Also I have been testing the performance of the tool, basically it is fast with word(doc/docx/rtf), but still not fast enough with pdf.

I am considering extracting the text from PDF, and converting the text to html:

    TextAbsorber textAbsorber = new TextAbsorber();
    textAbsorber.setExtractionOptions(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
    textAbsorber.getExtractionOptions().setScaleFactor((double) 0.5);
    pdf.getPages().accept(textAbsorber);
    text = textAbsorber.getText();

Do you think that’s a good idea? We just need the text and don’t break the layout and the style, regarding the fonts and images, we don’t need them at all.

guo.maleo · September 25, 2017, 5:36pm

And also, as the document languages are in different languages like Spanish, English, French, German, etc. I noticed that there are some special characters which are not recognizable.

imran.rafique · September 25, 2017, 8:44pm

@guo.maleo,
Kindly create use cases and share the complete details of each use case, including source PDF and code. We will investigate and share our findings with you.

In that way, you can only retrieve the plain text without the HTML tags and if you do not need to know about the formatting like heading text etc, then it is fine.

guo.maleo · September 25, 2017, 9:51pm

Yep, we tried extracting the formated text from pdf and then converted it to html, It has great performance. But we lose the font decoration like weight/bold, and from time to time, we get question marks in the html.

The other approach is converting the pdf directly to html in memory, but even I optimized the pdf and resources, it is still outputting fonts anyway.

Also, is it ok to talk to you via email? I can include my supervisor into the conversation.

imran.rafique · September 26, 2017, 12:18pm

@guo.maleo,
Please note, we do not provide support through the emails because it becomes difficult to keep track of emails in an organized way. You can ask your supervisor to communicate with us through the forum threads. We require your source PDF and code. We will investigate your scenario in our environment and share our findings with you.

guo.maleo · September 26, 2017, 3:30pm

Resume.pdf (259.2 KB)

Sounds good. I have uploaded a sample document to you, and the converted result. converted.zip (3.3 KB)

public static Map<String, String> convertPdf(com.aspose.pdf.Document pdf) {
    Map<String, String> result = new HashMap<String, String>();
    String text = "";
    String html = "";
    result.put("text", text);
    result.put("html", html);
    
    //pdf.optimize();
    pdf.optimizeResources();
    
    TextAbsorber textAbsorber = new TextAbsorber();
    textAbsorber.setExtractionOptions(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));
    //textAbsorber.getExtractionOptions().setScaleFactor((double) 0.5);
    pdf.getPages().accept(textAbsorber);
    text = textAbsorber.getText();
    result.put("text", text);
    
    ByteArrayOutputStream htmlStream = new ByteArrayOutputStream();
   
    try {
        com.aspose.words.Document textDoc = new com.aspose.words.Document(new ByteArrayInputStream(text.getBytes()));
        textDoc.save(htmlStream, com.aspose.words.SaveFormat.HTML);
        html = htmlStream.toString("UTF-8");
        htmlStream.close();
        result.put("html", html);
    } catch (Exception e) {
        e.printStackTrace();
    }
    
    return result;
}

The code I have been using. This time I didn’t see the question mark, but I think we can have improved conversion quality if we can directly convert PDF to HTML without outputting fonts in memory.

imran.rafique · September 27, 2017, 1:40am

@guo.maleo,
In perspective of Aspose.Pdf for Java API, we have logged an enhancement under the ticket ID PDFJAVA-37115 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.

guo.maleo · September 27, 2017, 4:54pm

@imran.rafique
Thanks so much for your quick response. We have been investigating the conversion also on our side. We found that those characters should be the bulletin characters. Looks like when we extract the pdf to text and then convert it to html, we are not able to handle those bulletin characters correctly.

Even we are able to convert the PDF really quick, but if we can convert the pdf directly to html without fonts output and breaking the JVM, that would be a better solution because it preserves the font decoration like the size, style, weight etc.

guo.maleo · September 27, 2017, 5:02pm

Also, we have another document, which doesn’t look good after the conversion. aspose-evaluation.zip (123.7 KB)

Is there a way to handle this case?

imran.rafique · September 28, 2017, 1:02am

@guo.maleo,
You are extracting text from the PDF, and Aspose.Pdf API returns an empty character in place of the bullet character. It is the correct behavior because if you will manually copy and paste text from a PDF into notepad, then you will find empty character inside the notepad with a question mark. Furthermore, we will notify you once the linked ticket ID PDFJAVA-37115 is resolved.

You are converting a Word document to HTML with Aspose.Words API and we have posted your query in the Aspose.Words forum. One of our fellow workers will assist you there soon.

Forum thread:

guo.maleo · September 28, 2017, 3:44pm

Thanks very much. Looks like we made the decision to use your product. And we will have more time to resolve the issues with the conversion.

imran.rafique · September 28, 2017, 10:27pm

@guo.maleo,
It is nice to hear from you about this. We recommend our clients to post each problem proactively with complete details in the Aspose site forums.

guo.maleo · September 29, 2017, 6:28pm

Is there any performance difference between windows and linux system? We observed that the batch job doesn’t have good performance on linux

imran.rafique · September 29, 2017, 11:22pm

@guo.maleo,
The performance depends on the various factors, including the complexity of the Visio drawing, drawing size and system memory, etc. We recommend our clients remove unused masters, reduce the number of group shapes, split a multi-page drawing into the separate drawings, remove unused themes, data graphics and styles. It is because Aspose.Diagram API loads each drawing into the main memory and then performs manipulation tasks.

However, if the JDK, batch job and Aspose.Diagram for Java API versions are same, and you think that the performance is downgraded due to the environment, then please share the complete details of the scenario and Visio drawings. We will investigate and share our findings with you.