Reconstructing PDF -> HTML -> PDF

asad.ali · February 19, 2021, 8:19pm

@anubha16

We are looking into the details of this scenario from Aspose.Words for Java perspective and will get back to you soon.

anubha16 · February 19, 2021, 8:20pm

Thank you so much.

anubha16 · February 19, 2021, 8:58pm

Also, we can work with pdf -> html conversion if there is a way to create the table html.

Currently when we do pdf to html conversion we get just spans and no table html formatting.

anubha16 · February 19, 2021, 10:31pm

Also with the Java SDKS can we use those in AWS Lambda?

asad.ali · February 20, 2021, 6:57pm

@anubha16

Thanks for your patience.

We have made an initial investigation and found that output HTML at the final step of your procedure i.e. Converting Word to HTML (Aspose.Words for Java) is not correct because the word file generated by Aspose.PDF was not in correct format. When we converted Aspose.PDF generated Word file into HTML using Aspose.Words for Java, output HTML had multiple images because the tables inside Word file were images actually. And there were no proper tags for table in HTML due to that.

An issue as PDFJAVA-40198 has been logged in our issue management system for the sake of further investigation against above mentioned case.

We also tried to convert PDF into Word (DOCX) using Aspose.Words for .NET (because .NET API is implemented behind the online Aspose.Words App). The output file was correctly formatted (as you were also mentioning it in your previous posts). Then used this obtained file to convert into HTML by Aspose.Words for Java and the output HTML contained valid table tags like <tr> and <td>.

Since you are using Java so we cannot suggest you to use Aspose.Words for .NET to convert PDF into DOCX as Aspose.Words for Java does not support PDF to DOCX conversion at the moment. Feature request is already logged as WORDSJAVA-2366 in our issue tracking system and has been associated with this forum thread so that you will get a notification as soon as it is resolved.

Conclusion

The Aspose.PDF is not generating DOCX correctly. The tables are in image format in the output which is why the Aspose.Words is not able to generate correct HTML from it (PDFJAVA-40198).

We have also addressed this issue and logged it under the ticket ID PDFJAVA-40199 in our issue management system. We will further look into the details of all logged tickets and keep you posted with the status of their correction. Please be patient and spare us some time.

We apologize for the inconvenience.

You can surely use Aspose Java APIs as long as the supported JDK Version is installed in the system.

anubha16 · February 20, 2021, 9:37pm

Thank you for all your help and investigation.

Is there an approximate timeline for when WORDSJAVA-2366 feature request would be available?

tahir.manzoor · February 21, 2021, 9:52am

@anubha16

Unfortunately, there is no ETA available for this feature at the moment. We will inform you via this forum thread once there is an update available on it.

anubha16 · February 22, 2021, 5:03pm

What is the difference between Aspose Words pdf -> docx vs Aspose PDF pdf -> docx?

For a sample document when I use the online converter Aspose Words is removing some of the text while in Aspose PDF it doesn’t. Basically, for that same document I am getting better results using Aspose pdf vs Aspose words.

anubha16 · February 22, 2021, 8:23pm

Please see attached zip which contains GERSD1Aspose.pdf which I converted to GERSD1Aspose.docx using Convert Files Online - Word, PDF, HTML, JPG And Many More.

As you can also in the the attached screenshot the data in the top part of the document is missing in the docx file. What could be the reason for that?GERSD1Aspose.zip (420.5 KB)

Thank you again for all your help.

asad.ali · February 22, 2021, 9:58pm

@anubha16

We are checking the scenario from Aspose.Words perspective and will get back to you soon.

tahir.manzoor · February 23, 2021, 6:10am

@anubha16

You are using Aspose.Words for Java and PDF to Word conversion does not support by it. You are testing this case using Aspose.Words for .NET. We have logged this issue as WORDSNET-21875 in our issue tracking system and you will be notified via this forum thread once this issue is resolved.

Aspose.PDF provides functionality to convert PDF to MS Word document. However, the Word document generated by Aspose.PDF has some issue and this issue was logged by @asad.ali.

Please note that Aspose.Words mimics the behavior of MS Word. It means that when you convert Word document to HTML using Aspose.Word and MS Word, the output should be same.

anubha16 · February 23, 2021, 5:17pm

The issue with GERSD1Aspose.zip above is a different file and is missing data when we use Aspose Words online converter.

We are can use .Net instead of Java but want to ensure that Aspose Words pdf -> docx does not remove data. We are encountering that with GERSD1Aspose.zip that I sent.

anubha16 · February 23, 2021, 11:12pm

Thank you again for your help. I am also having problems with the attached zip. FRASD6Aspose.zip (412.8 KB)

I am doing FRASD6Aspose.pdf -> FRASD6Aspose.docx - FRASD6Aspose (html and images).
The image files that the html references get created as files called images\Aspose.Words.bc43573a-9630-46ae-8d25-70d459a25c16.041.png
However in the html they are referenced as

<img src=“images/Aspose.Words.bc43573a-9630-46ae-8d25-70d459a25c16.008.png”

So when I try to render the html in the browser it looks like Screen Shot 2021-02-23 at 3.03.56 PM.png (107.7 KB)

I tried creating an images folder and then copied the images there and then the html worked in the broswer
I tried replacing the original content in the html file and created a new zip file and tried to use Aspose Words to create a new pdf file but got errors.

Error 500: Invalid document model. Operation can not be completed.

Is this scenario valid for Aspose as this is one of my use cases.

tahir.manzoor · February 24, 2021, 10:19am

@anubha16

We will inform you via this forum thread once this issue is resolved.

We have tested the scenario using the latest version of Aspose.Words for .NET 21.2 with following code example. We have not faced the shared issue. So, please use Aspose.Words for .NET 21.2.

Document doc = new Document(@"FRASD6Aspose.pdf");
doc.Save(@"FRASD6Aspose.docx");

Document doc2 = new Document(@"FRASD6Aspose.docx");
doc2.Save(@"FRASD6Aspose.html");

anubha16 · February 24, 2021, 6:08pm

Thanks again for your help.

For the issue with Error 500: Invalid document model. Operation can not be completed and related to file FRASD6Aspose.pdf, what about it you make changes to the generated html and try to reconstruct back to pdf? Do you face any issues using Aspose.Words for .NET 21.2.?

And the online tool for Aspose Words does that not use .NET 21.2 on the backend? Since I am a Java developer I will need to look into setting up .NET but wanted to get an idea of what actually works with .NET 21.2.

anubha16 · February 26, 2021, 10:31pm

Hello support team and thank you again for your assistance.

We found that for our use case using Aspose PDF (pdf to xlsx) -> Aspose Cells (xlsx to html) might work as most important for us is the structure of the data.

However, I am getting an issue with the html where a sentence within a tag is extracted by each individual character. Attached are the zipped files (pdf, xlsx, html). It seems to happen where there is a checkbox in the original pdf but in another document I also got this issue where no checkbox was present.

Please see in SDAspose12cellsswidthscalable_files/sheet001.htm. The text is extracted by each individual character. Is there anyway of avoiding that as I really need it to be one segment not split over multiple characters.?

SDAspose12cells.xlsx

A35 - Si le numéro de lot est indisponible ou inconnu, cocher une case ci-dessous

sheet001.htm (please see line number 498,499)

Le notificateu refus e de comuniquer le num ero de lot

Below is my code.

public static void convert() {
// Load PDF document
String dataDir = “./cells/SDASPOSE12/”;
Document pdfDocument = new Document(dataDir + “SDAspose12cells.pdf”);
// Instantiate ExcelSave Option object
ExcelSaveOptions excelsave = new ExcelSaveOptions();
excelsave.setFormat(ExcelSaveOptions.ExcelFormat.XLSX);

    // Save the output to XLS format
    pdfDocument.save(dataDir + "SDAspose12cells.xlsx", excelsave);

    //save xlsx to html
    HtmlSaveOptions save = new HtmlSaveOptions(SaveFormat.HTML);
    save.setWidthScalable(true);

    //save.setExportGridLines(true);
    try {
        Workbook book = new Workbook(dataDir + "SDAspose12cells.xlsx");
        book.save(dataDir + "SDAspose12cellsswidthscalable.html", save);
    } catch (Exception ex) {
        System.out.println(ex);
    }
}

Also if I modify the sheet001.htm and then want to recreate the pdf, how do I do that. I tried creating a new zip and using the online converter Aspose html but that did not work.

amjad.sahi · February 27, 2021, 1:40pm

@anubha16,

Aspose.Cells follows MS Excel standards and specifications, so you may try to save the generated XLSX file (by Aspose.PDF) to HTML using MS Excel manually and evaluate if you notice the same thing/issue.

We could not find your attachments, please zip the files in an archive and attach it. Also provide some screenshots to highlight the issue.

anubha16 · February 28, 2021, 10:35pm

SDAspose12.zip (207.2 KB)

Sorry, forgot to attach the zip

anubha16 · February 28, 2021, 10:46pm

Screen Shot 2021-02-28 at 2.43.52 PM.png (372.8 KB)

I have highlighted the text on the left and then the inspect element.
Why is there a font class for each letter in the sentence?

Is there a way of avoiding this?

asad.ali · March 1, 2021, 9:37am

@anubha16

We are testing the scenario at our end and will share our feedback with you shortly.