I have a PDF file that I need to convert to HTML to translate the content. The extraction renders HTML and a number of jpeg files. I modify the HTML and replace the old HTML and try to reconstruct. However, the new PDF document renders with the html content and the jpegs on separate pages.
Are you using this app for PDF to HTML conversion or our stand-alone API? Could you please share more details on this scenario along-with the problematic files.
Please note that Aspose.Words mimics the behavior of MS Word. If you convert Word document to HTML using MS Word, you will get the same output.
We have converted the Word document to HTML using the latest version of Aspose.Words and have not found the shared issue. The output generated by Aspose.Words looks better than output generated by MS Word.
We are checking this scenario and will get back to you soon.
As per our understanding of your scenario, you are doing:
Convert PDF to Word file using Aspose.PDF
Convert Word file to HTML using Aspose.Words
Rendering the obtained HTML in browser
At the last step, you are facing an issue where images are rendering below the text content (as you shared in the image).
We would like to share with you that Aspose.PDF for Java can alone be used to convert PDF to HTML directly. You can please try using the below code snippet which converts source PDF into single HTML file with all resources embedded into it. You can render obtained HTML in browser without any issue:
Document doc = new Document(dataDir + "SD_Aspose.pdf");
HtmlSaveOptions newOptions = new HtmlSaveOptions();
newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
// this is just optimozation for IE and can be omitted
newOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
newOptions.RemoveEmptyAreasOnTopAndBottom = true;
doc.save(dataDir + "sample.html", newOptions);
Please feel free to let us know if we missed something or did not understand your requirements correctly. We will share our feedback with you accordingly.
As per our understanding of your scenario, you are doing:
Convert PDF to Word file using Aspose.PDF
Convert Word file to HTML using Aspose.Words
Rendering the obtained HTML in browser
The reason I am doing PDF to Word first is because I am going to replace the content with translated text. I tried PDF to HTML but that gave multiple spans and if I replaced with longer content the text bled into other cells.
Therefore, for my use case I have to do PDF -> DOCX -> HTML
Could you please share how you are replacing the content with translated text? Are you using Aspose.Words for the purpose? Please share the code snippet for it which you are using. Also, please share the edited/updated word document which you are obtaining after replacing the content. We will convert it into HTML using Aspose.Words and try to observe the issue in our environment to address it accordingly.
We will not use Aspose words but a translate tool.
Here is the code to convert to docx. I tried also without saveOptions and the code above. SD_Aspose.zip (20.9 KB)
public static void convertPDFToWord() {
try {
// Load source PDF file
com.aspose.pdf.Document doc = new com.aspose.pdf.Document(“SD_Aspose.pdf”);
// Instantiate DocSaveOptions instance
DocSaveOptions saveOptions = new DocSaveOptions();
// Set output format
saveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
// Set the recognition mode as Flow
saveOptions.setMode(DocSaveOptions.RecognitionMode.EnhancedFlow);
// Set the horizontal proximity as 2.5
saveOptions.setRelativeHorizontalProximity(2.5f);
// Enable bullets recognition during conversion process
saveOptions.setRecognizeBullets(true);
// Save resultant DOCX file
doc.save(“SD_Aspose.docx”, saveOptions);
} catch (Exception ex) {
System.out.println(ex);
}
}
We have checked the .docx file shared by you and noticed that formatting was incorrect. Furthermore, we have used the below code snippet to convert PDF into DOCX and obtained the attached output Word file:
Document doc = new Document(dataDir + "SD_Aspose.pdf");
DocSaveOptions saveOption = new DocSaveOptions();
saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
saveOption.setRecognizeBullets(true);
doc.save(dataDir + "Sample_21.1.docx", saveOption);
Would you please check and process it. Please let us know if face any issue while replacing content inside it. We will further proceed to assist you accordingly.
It was same as the source .docx file. It seems like you are getting quite different results due to evaluation version usage. Please try to obtain a free 30-days temporary license and use it to test the scenario again using the below complete code snippet and let us know in case you still notice any issue. We will further proceed to assist you accordingly:
// using Aspose.PDF
Document doc = new Document(dataDir + "SD_Aspose.pdf");
DocSaveOptions saveOption = new DocSaveOptions();
saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
saveOption.setRecognizeBullets(true);
doc.save(dataDir + "Sample_21.1.docx", saveOption);
// using Aspose.Words
com.aspose.words.Document document = new com.aspose.words.Document(dataDir + "Sample_21.1.docx");
// Save the output fil
document.save(dataDir + "SD_Aspose1.html", com.aspose.words.SaveFormat.HTML);
Thank you for your continued help and the samples. However, the html is still different to what I get when I use Convert Files Online - Word, PDF, HTML, JPG And Many More for pdf → docx → html. I am attaching the output. If you see the html output it creates just 2 png files and has the table structure as part of the html.
In your html output, there are 10 jpeg files which seem to contain the tables not in the html.
Please note that the online utilities implement .NET versions of the APIs and yes, there is difference between the results of Aspose.Words App and Aspose.PDF App because of the use of different APIs in the code behind.
Furthermore, would you please let us know if this is the expected result which you actually require by using a Java program at your end?