Reconstructing PDF -> HTML -> PDF

Hi,

I have a PDF file that I need to convert to HTML to translate the content. The extraction renders HTML and a number of jpeg files. I modify the HTML and replace the old HTML and try to reconstruct. However, the new PDF document renders with the html content and the jpegs on separate pages.

@anubha16

Are you using this app for PDF to HTML conversion or our stand-alone API? Could you please share more details on this scenario along-with the problematic files.

I am using the Java APIs. Attached are the pdf, docx (SDAsposePDFWord.zip) and html (SDAsposeHTML.zip) files.

SDAsposePDFWord.zip (298.6 KB)

  1. First I convert PDF to Word
  2. Then convert Word to HTML

public static void convertPDFToWord() {
try {
// Load source PDF file
com.aspose.pdf.Document doc = new com.aspose.pdf.Document(“SD_Aspose.pdf”);
doc.save(“SD_Aspose.docx”, SaveFormat.DocX);
} catch (Exception ex) {
System.out.println(ex);
}
}

public static void convertWordHTML() {
try {
Document doc = new Document(“SD_Aspose.docx”);
String dataDir = “SDAspose/”;
String outHtmlFile = “SD_Aspose.html”;
// Save the output file
doc.save(dataDir + outHtmlFile, com.aspose.words.SaveFormat.HTML);
} catch (Exception ex) {
System.out.println(ex);
}
}
SDAsposeHTML.zip (90.6 KB)
SDAsposePDFWord.zip (299 KB)

I have to convert to docx first because I need to translate the text and I am converting to html because I need to display it in a browser.

I used this tool Convert Files Online - Word, PDF, HTML, JPG And Many More to convert the pdf and got the attached below and the files look good. So, what am I doing wrong when using the Java sdk?

SD_Aspose.docx.zip (17.6 KB)

SD_Aspose.zip (7.6 KB)

@anubha16

Please note that Aspose.Words mimics the behavior of MS Word. If you convert Word document to HTML using MS Word, you will get the same output.

We have converted the Word document to HTML using the latest version of Aspose.Words and have not found the shared issue. The output generated by Aspose.Words looks better than output generated by MS Word.

We are checking this scenario and will get back to you soon.

Is there some configuration that I need to set when I convert the docx to html when using the Java API?

Attached is a screenshot of when I try and open the html file.

Screen Shot 2021-02-18 at 8.55.03 AM.png (126.9 KB)

@anubha16

As per our understanding of your scenario, you are doing:

  • Convert PDF to Word file using Aspose.PDF
  • Convert Word file to HTML using Aspose.Words
  • Rendering the obtained HTML in browser

At the last step, you are facing an issue where images are rendering below the text content (as you shared in the image).

We would like to share with you that Aspose.PDF for Java can alone be used to convert PDF to HTML directly. You can please try using the below code snippet which converts source PDF into single HTML file with all resources embedded into it. You can render obtained HTML in browser without any issue:

Document doc = new Document(dataDir + "SD_Aspose.pdf");

HtmlSaveOptions newOptions = new HtmlSaveOptions();
newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
// this is just optimozation for IE and can be omitted
newOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
newOptions.RemoveEmptyAreasOnTopAndBottom = true;

doc.save(dataDir + "sample.html", newOptions);

sample.zip (282.9 KB)

Please feel free to let us know if we missed something or did not understand your requirements correctly. We will share our feedback with you accordingly.

Yes, that is correct.

As per our understanding of your scenario, you are doing:

Convert PDF to Word file using Aspose.PDF
Convert Word file to HTML using Aspose.Words
Rendering the obtained HTML in browser

The reason I am doing PDF to Word first is because I am going to replace the content with translated text. I tried PDF to HTML but that gave multiple spans and if I replaced with longer content the text bled into other cells.
Therefore, for my use case I have to do PDF -> DOCX -> HTML

@anubha16

Could you please share how you are replacing the content with translated text? Are you using Aspose.Words for the purpose? Please share the code snippet for it which you are using. Also, please share the edited/updated word document which you are obtaining after replacing the content. We will convert it into HTML using Aspose.Words and try to observe the issue in our environment to address it accordingly.

We will not use Aspose words but a translate tool.

Here is the code to convert to docx. I tried also without saveOptions and the code above. SD_Aspose.zip (20.9 KB)

public static void convertPDFToWord() {
try {
// Load source PDF file
com.aspose.pdf.Document doc = new com.aspose.pdf.Document(“SD_Aspose.pdf”);

// Instantiate DocSaveOptions instance
DocSaveOptions saveOptions = new DocSaveOptions();

// Set output format
saveOptions.setFormat(DocSaveOptions.DocFormat.DocX);

// Set the recognition mode as Flow
saveOptions.setMode(DocSaveOptions.RecognitionMode.EnhancedFlow);

// Set the horizontal proximity as 2.5
saveOptions.setRelativeHorizontalProximity(2.5f);

// Enable bullets recognition during conversion process
saveOptions.setRecognizeBullets(true);

// Save resultant DOCX file

doc.save(“SD_Aspose.docx”, saveOptions);
} catch (Exception ex) {
System.out.println(ex);
}
}

The zip contains the docx file.

@anubha16

We have checked the .docx file shared by you and noticed that formatting was incorrect. Furthermore, we have used the below code snippet to convert PDF into DOCX and obtained the attached output Word file:

Document doc = new Document(dataDir + "SD_Aspose.pdf");
DocSaveOptions saveOption = new DocSaveOptions();
saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
saveOption.setRecognizeBullets(true);
doc.save(dataDir + "Sample_21.1.docx", saveOption);

Sample_21.1.zip (59.3 KB)

Would you please check and process it. Please let us know if face any issue while replacing content inside it. We will further proceed to assist you accordingly.

Thank you, the docx looks better but when I try to convert to html I still don’t get a good result. Please see attached files.

SDApsoselatest.zip (387.4 KB)

Below is the code to convert to html

public static void convertWordHTML() {
        try {
            String dataDir = "./samples/";
            Document doc = new Document(dataDir + "SD_Aspose.docx");
            String outHtmlFile = "SD_Aspose1.html";
// Save the output fil
            doc.save(dataDir + "SD_Aspose1.html", com.aspose.words.SaveFormat.HTML);
           
        } catch (Exception ex) {
            System.out.println(ex);
        }
    }

Also, could the issue I am facing be because I am using an evaluation copy of Aspose Words/PDF?

@anubha16

Please check the attached output HTML which we were able to obtain using licensed version of Aspose.Words for Java.

outputhtml.zip (701.9 KB)

It was same as the source .docx file. It seems like you are getting quite different results due to evaluation version usage. Please try to obtain a free 30-days temporary license and use it to test the scenario again using the below complete code snippet and let us know in case you still notice any issue. We will further proceed to assist you accordingly:

// using Aspose.PDF
Document doc = new Document(dataDir + "SD_Aspose.pdf");
DocSaveOptions saveOption = new DocSaveOptions();
saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
saveOption.setRecognizeBullets(true);
doc.save(dataDir + "Sample_21.1.docx", saveOption);
// using Aspose.Words
com.aspose.words.Document document = new com.aspose.words.Document(dataDir + "Sample_21.1.docx");
// Save the output fil
document.save(dataDir + "SD_Aspose1.html", com.aspose.words.SaveFormat.HTML);

Thank you for your continued help and the samples. However, the html is still different to what I get when I use Convert Files Online - Word, PDF, HTML, JPG And Many More for pdf → docx → html. I am attaching the output. If you see the html output it creates just 2 png files and has the table structure as part of the html.

In your html output, there are 10 jpeg files which seem to contain the tables not in the html.

Below are my steps.

  1. Use Convert Files Online - Word, PDF, HTML, JPG And Many More for pdf → docx
  2. Download docx file
  3. Use Convert Files Online - Word, PDF, HTML, JPG And Many More for docx → html

It seems there is a difference when you use Convert Files Online - Word, PDF, HTML, JPG And Many More vs Convert PDF | Online and Free for pdf conversion.

SD_AsposeOnline0219.zip (201.8 KB)

@anubha16

Please note that the online utilities implement .NET versions of the APIs and yes, there is difference between the results of Aspose.Words App and Aspose.PDF App because of the use of different APIs in the code behind.

Furthermore, would you please let us know if this is the expected result which you actually require by using a Java program at your end?

Yes, we need our Java code to have the results the same as using Aspose.Words.

Also, to convert the html back to pdf I am using Java Aspose Words as Java Aspose HTML gives me error.