Converting from PDF to DOCX doesn't produce the same result

AxiomaBo · February 17, 2022, 3:17pm

I need to convert a PDF to a DOCX using Aspose-PDF 22.1.
The results are similar, from the graphic poit of view but not identical.
OpenOffice opens the docx file but every information has a ligth rectangle on it.
Word 2016 do not open the file telling the file is corrunped.
What’s wrong ?
Tks

asad.ali · February 17, 2022, 8:51pm

@AxiomaBo

Could you please share the sample PDF document for our reference so that we can test the scenario in our environment and address it accordingly?

AxiomaBo · February 18, 2022, 8:33am

152431_SuperPreventivo_20220217160247.pdf (12.2 KB)
152431_SuperPreventivo_20220217160111.docx (64.3 KB)
Attached the original pdf and the converted docx.
Tks

asad.ali · February 18, 2022, 6:39pm

@AxiomaBo

We used below code snippet with Aspose.PDF for .NET 22.2 to convert your PDF to DOCX and did not notice any issue in the output DOCX file. Can you please make sure that you are using the below code snippet and the latest version of the API?

Document pdfDocument = new Document(dataDir + @"152431_SuperPreventivo_20220217160247.pdf");

DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.Mode = DocSaveOptions.RecognitionMode.EnhancedFlow;
saveOptions.RelativeHorizontalProximity = 2.5f;
saveOptions.RecognizeBullets = true;
pdfDocument.Save(dataDir + @"152431_SuperPreventivo_20220217160247.docx", saveOptions);

152431_SuperPreventivo_20220217160247.docx (185.6 KB)

AxiomaBo · February 19, 2022, 9:02am

Sorry but I use the java version (22.1) not the .NET one.
I didn’t mention it before.
But the problem is still there
Tks

asad.ali · February 19, 2022, 7:53pm

@AxiomaBo

Please check the attached output that we obtained using Aspose.PDF for Java 22.1 in our environment. The used code snippet is also attached:

Document doc = new Document(dataDir + "152431_SuperPreventivo_20220217160247.pdf");
DocSaveOptions saveOption = new DocSaveOptions();
saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
//saveOption.setAddReturnToLineEnd(false);

//saveOption.setCloseResponse(false);
saveOption.setRecognizeBullets(true);
doc.save(dataDir + "152431_SuperPreventivo_20220217160247.docx", saveOption);

152431_SuperPreventivo_20220217160247.docx (64.5 KB)

Can you please make sure that all windows fonts are installed properly in your system? Also, are you using the API with a valid license? Please share your complete environment details like OS Name and Version, Application Type, JDK Version if issue still persists.

AxiomaBo · February 21, 2022, 9:12am

Unfortunately it doesn’t work.
I opened Your docx file and it seems correct, but in my environment it doesn’t work.
public byte[] exportPrintConverted(JasperPrint xpPrint) throws JRException { byte[] result = JasperExportManager.exportReportToPdf(xpPrint); com.aspose.pdf.Document doc = new com.aspose.pdf.Document(result); ByteArrayOutputStream output = new ByteArrayOutputStream(); DocSaveOptions saveOption = new DocSaveOptions(); saveOption.setMode(DocSaveOptions.RecognitionMode.Flow); saveOption.setFormat(DocSaveOptions.DocFormat.DocX); saveOption.setRecognizeBullets(true); doc.save(output, saveOption); doc.close(); return output.toByteArray(); }
This is my code, I create the pdf on the fly and then I try to convert the byte stream, but I get the same error as before.
I’m using Windows 10 Pro build 19044.1526 updated to the last patchset.
My jdk is Amazon Corretto 1.8.0_275.
At present I’m not using a license because I’m in a proof of concept phase.
Tks
Tullio

asad.ali · February 21, 2022, 5:46pm

@AxiomaBo

You don’t need to purchase any license while evaluating the API. You can use a 30-days free temporary license to test the API in its full capacity and without any restrictions. Please use the valid license and make sure that all Windows Fonts are installed in the System. Please let us know about the complete exception details in case issue still persists.

AxiomaBo · February 22, 2022, 9:47am

Unfortunately nothing changed.
I used a licence and the result was the same.
I don’t know how to verify if all windows font were installed, however the error tells (more or less because I’m translating it back from italian) : Is not possible to open xxxxxx because a problem was found in the content.
Details : The file is corrupted and cannot be opened.
I use Word 2016.
Tks

asad.ali · February 22, 2022, 6:14pm

@AxiomaBo

It seems strange as we were still unable to reproduce the issue. Can you please try to use the file from a path instead of a byte array or stream? Like this:

Document doc = new Document(dataDir + "152431_SuperPreventivo_20220217160247.pdf");

It is also possible that the document in the form of the stream is not a valid PDF document and API is unable to convert it into a valid word document.

AxiomaBo · February 23, 2022, 11:25am

Again nothing changed.
Now I took the byte[] and wrote it on a pdf file.
Then I used the instruction You suggested to read the document and saved the docx file but the error is still there.
I uploade the produced pdf and the converterd file.
It seems to be a format problem of the docx file because :
. using Libreoffice I’m able to open it
. using Word 2016 it refuse to open.
Tks
152431_SuperPreventivo_20220223121802.docx (64.5 KB)
Prova.pdf (12.2 KB)

asad.ali · February 23, 2022, 7:19pm

@AxiomaBo

We have logged an investigation ticket as PDFJAVA-41360 in our issue tracking system along with the details of environment that you have provided. We will further look into its details and try to analyze the reason behind this behavior of the API in your specific environment. We will let you know once the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

AxiomaBo · October 19, 2022, 4:17pm

Any new about that ?

asad.ali · October 19, 2022, 8:07pm

@AxiomaBo

We are afraid that the earlier logged ticket has not been yet resolved due to other issues in the queue. We will surely resolve it on a first come first serve basis and as soon as we make some significant progress towards its resolution, we will update you via this forum thread.

We apologize for the inconvenience.