Carriage return line feed after each line in PDF

Hi Aspose Team!

I’m facing this behaviour still with the latest version today from nuGet. (Aspose.Words: 19.1.0.0)
So it seems that WORDSNET-4887 has not been resolved.
Do i miss some option/setting?

Thanks

@fankhauser.kufgem.at,

Please ZIP and attach your input Word document and Aspose.Words generated output PDF file showing the undesired behaviour here for testing. We will then investigate your issue on our end and provide you more information.

Here is the zip with both files.
LoremIpsum.zip (51.6 KB)

@fankhauser.kufgem.at,

We have converted your ‘LoremIpsum.docx’ document to PDF by using Aspose.Words for .NET 19.1 and MS Word 2019 and attached them here for your reference (see 19.1.pdf (35.7 KB) and msw-2019.pdf (65.4 KB)). In this case, the latest version of Aspose.Words mimics the way the Microsoft Word works. Please upgrade to latest version and see how it goes on your end. Hope, this helps.

In case the problem still remains, please also provide a comparison screenshot highlighting the problematic area(s) in above 19.1.pdf document with respect to msw-2019.pdf and attach it here for our reference. Thanks for your cooperation.

Thank your for your response.
Im already using the Aspose.Words version 19.1. and MS Office Word 365 (actual version).

I figured out now, that i have this behavior also with your pdf-document in some circumstances:
The embedded preview from my browser (Chrome) is able to copy the text without the CR+LF’s.
But if i open the PDF-Document with Adobe Acrobat DC Version 2019.010.20069 (or earlier) the CR+LF’s are present!
This behaviour does not occur if im using MS-Office Word to create the PDF document. In that case also the Adobe Acrobat Reader is able to provide the copied text well without the carriage returns.

Thanks in advance

(I have attached my files created by word and aspose)
LoremIpsum.zip (116.7 KB)

@fankhauser.kufgem.at,

You can set the PdfSaveOptions.ExportDocumentStructure flag when saving to PDF. In this case, text copied from PDFs produced by Aspose.Words would not have the line breaks you are having troubles with.

Document doc = new Document("E:\\LoremIpsum (1)\\LoremIpsum.docx");

PdfSaveOptions opts = new PdfSaveOptions();
opts.ExportDocumentStructure = true;

doc.Save("E:\\LoremIpsum (1)\\19.1-ExportDocumentStructure.pdf", opts);

To verify, please open 19.1-ExportDocumentStructure.pdf (43.0 KB) with Adobe Acrobat, then copy all text and paste it in notepad. Hope, this helps.

Thank you, that solved the structrue problem, a bit.

If i enable the option “ExportDocumentStructure” and copy a part of the the text in acrobat back to a MSWord Document the font of the pastet text is changing within the text to a different one. I figured out that this applies only to the part of text between the last selected “real” carriage and the carrage return penultimate before it.
It looks like that it assumes the font thats defined at the very end of the document text. A font which is neved really used for visible characters in this document. This does not happen if the ExportDocumentStructure is set to false.
But it also happens if you use CNTRL+A an paste the text back to MSWord when ExportDocumentStructure was set to false.

Enabling the structure export also inserts a invisible carriage reurn at the beginning of the document.
To reproduce copy the first few characters (or lines) in the pdf in acrobat and copy it to a MSWord document. - There you can see, that the text is now starting with a carriage return, which has never been there.

I have attached samples

Thanks

samples.zip (493.3 KB)

@fankhauser.kufgem.at,

Thanks for the additional information. We tested the scenario and managed to reproduce the same problem on our end. For the sake of correction, we have logged this problem in our issue tracking system. The ID of this issue is WORDSNET-18105. We will further look into the details of this problem and will keep you updated on the status of correction. We apologize for your inconvenience.

@fankhauser.kufgem.at,

Upon further investigation, we have found that these issues are related to peculiarities of the Adobe Acrobat text extraction. Aspose.Words’ output is correct i.e. there is no explicit line break in the output text and “Courier New” font is used for the main body text. Also the same issues could be observed when extracting text from MS Word generated PDF output. We do not think we can do anything about it.

Adobe Acrobat also provides a “Copy with Formatting” option. You may use that option to avoid these issues.

@fankhauser.kufgem.at,

Regarding WORDSNET-18105, we have completed the work on your issue and concluded to close this issue as ‘Not a Bug’. Please see my previous post for analysis details.