Issues while migrating from 17.2.1 Java PDF to 17.9

Attachments.zip (562.5 KB)
At present we are using APOSE JAVA PDF 17.2.1 version jar. We are planning to use latest version i.e. 17.9. While doing regression testing with 17.9 PDF jar we found some different which are causing functional issues in our product. Can you please look into those and provide inputs.

Attachment Details:

  • Input PDF Document: 2006 OH App Ct Briefs LEXIS 133.pdf
  • Word Document converted from “PDF” using ASPOSE Java 17.2.1 : 2006 OH App Ct Briefs LEXIS 133-CP_PDF_17.2.1.docx
  • Word Document converted from “PDF” using ASPOSE Java 17.9 : 2006 OH App Ct Briefs LEXIS 133-PDF_17.9.0.docx

Issue 1: Word doc converted using latest 17.9 is different from 17.2.1
Please refer attached “Issue1_ExampleOutput_WordDocDifference.docx” document having example difference. Can you please take a look and explain why this difference is occurring with latest 17.9 jar?

Issue 2: RUN object differences in word document converted from 17.2.1 vs 17.9. Latest 17.9 is splitting the text into multiple runs e.g.
For following text in the PDF document:
Telephone: (216) 621-1500 Facsimile: (216) 621-1551 E- mail: rdkehoe@kehoelaw.net E- mail: ibkenneyakehoelaw.net

we are getting different runs i.e. “rdkehoe@kehoelaw.net” email id text is getting split into two runs in 17.9.

Sysout of runs while processing “2006 OH App Ct Briefs LEXIS 133-CP_PDF_17.2.1.docx” word document converted using 17.2.1 PDF jar.

    run.getNodeType()=Run text=Telephone:
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=(216)
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=621-15
    run.getNodeType()=Run text=00 
    run.getNodeType()=Run text=Facsim
    run.getNodeType()=Run text=ile:
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=(216)
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=621-155
    run.getNodeType()=Run text=1
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=E- 
    run.getNodeType()=Run text=mail:
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=rdkehoe@kehoelaw.net
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=E- 
    run.getNodeType()=Run text=mail:
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=ibkenneyakehoelaw.net

Sysout of run while processing "2006 OH App Ct Briefs LEXIS 133-PDF_17.9.0.docx" word document converted using 17.9 PDF jar.
    run.getNodeType()=Run text=Telephone:
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=(216)
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=621-15
    run.getNodeType()=Run text=00 
    run.getNodeType()=Run text=Facsim
    run.getNodeType()=Run text=ile:
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=(216)
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=621-155
    run.getNodeType()=Run text=1
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=E- 
    run.getNodeType()=Run text=mail:
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=rdke
    run.getNodeType()=Run text=hoe@kehoelaw.net
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=E- 
    run.getNodeType()=Run text=mail:
    run.getNodeType()=Run text= 
    run.getNodeType()=Run text=ibkenneyakehoelaw.net

Note: In word document converted using 17.9 PDF, “rdkehoe@kehoelaw.net” text got split into two runs. Can you please explain the reason for this.

Issue 3: LineSpacing Differences in word document converted using PDF 17.2.1 and 17.9
Please refer attached Issue3 document showing the exact differences. Refer following input and converted document also:
Attachment Details:

  • Input PDF Document: CiteSpreadToNextPage.pdf
  • Word Document converted from “PDF” using ASPOSE Java 17.2.1 : CiteSpreadToNextPage-CP_PDF_17.2.1.docx
  • Word Document converted from “PDF” using ASPOSE Java 17.9 : CiteSpreadToNextPage-PDF_17.9.0.docx

Following is the code snippet which is used to convert PDF to WORD using both 17.2.1 and 17.9 PDF JAVA jars.

  com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(inStream);
  // Instantiate Doc SaveOptions instance
  DocSaveOptions saveOptions = new DocSaveOptions();
  saveOptions.setMode(DocSaveOptions.RecognitionMode.Flow);
  saveOptions.setMaxDistanceBetweenTextLines(3.5f);

  // Set output file format as DOCX
  saveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
  saveOptions.setAddReturnToLineEnd(false);
  // Save resultant DOCX file
  pdfDocument.save(outStream, saveOptions);
  pdfDocument.close();

@Kusumanchi.Rajesh,

We are working over your query and will get back to you soon.

@Kusumanchi.Rajesh,

It is not a bug because the latest version 17.9 of Aspose.Pdf for Java API renders the text in the same layout as we can see in the input PDF document. It may a bug in the old version 17.2.1.

A text phrase can be saved into the multiple run nodes, the main goal is that the output view should be the same as we can see in the input PDF document. In this scenario, we can see that the view is same in the both Word documents generated by version 17.2.1 and 17.9. However, the output view does not match with the source PDF document. It is the snapshot: LineBreaksIssue.png (32.5 KB)

It has been logged under the ticket ID PDFJAVA-37168 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.

When you will view last three text lines in the PDF viewer, the line spaces are equivalent. In the output Word document generated by version 17.2.1, the line space are not equivalent. We can see equivalent line spaces in the output Word document generated by the version 17.9. However, we can find that the horizontal line is misplaced between the last two text lines. This problem has been logged under the ticket ID PDFJAVA-37169 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.