Aspose.PDF and Aspose.Words functionality questions

Hey,

We have a translation service (using our in-house translation capabilities) that supports basic office documents using OpenXML. We would like to support more document types, such as PDF, by using Aspose. I ran into a few issues trying to implement it:

1. My first attempt is to use the find() and replace() function in Aspose.PDF. The result is lacking since there is no way to read the properties (font type, font size) of the text. Is there a way to navigate to the DOM itself so that I can manipulate the text? Or is there a way for me to "read" the font type and font size of the text that I am replacing?

2. My second attempt is to use Aspose.PDF to convert the pdf file to docx, run it through our existing translation code, convert the translated docx back to a pdf, and then serve it to the users. The problem is, the Aspose.PDF produced docx file doesn't seem to be completed. It is missing structure information.

I converted the docx file a second time (pdf -> docx -> docx) using Aspose.Words, and now I can run the file through our translation code without a hitch. But when I save the translated docx file back to pdf, the output would be nothing (not even blank page, just nothing).

I have a feeling that Aspose.PDF actually only works with doc, not docx. So I converted the translated docx to doc using Aspose.Words, and then save it back to pdf using Aspose.PDF. That seems to do the trick. However, the layout of the docx file is being disregarded, and the content will default to the top left hand corner. I am still trying to figure out a way to fix it.

To summarize: I have to convert a pdf to docx (using Aspose.PDF), then to a "complete" docx (using Aspose.Words) in order to do my business logic. And then convert the translated docx to doc (using Aspose.Words), and then back to pdf (using Aspose.PDF).

This just seems like a really round-about way. Is there a better or correct way to convert documents? Or does Aspose.PDF only convert from and to doc files?

I want to make this work, so let me know if you have any suggestions or alternatives. I have attached the input, PDFtoDOCX (using Aspose.PDF), DOCXtoDOCX (using Aspose.Words), and output files for reference (DocumentSamples.zip). I am using Aspose.PDF 8.3.1 and Aspose.Words 13.7.

Thanks.

Hi,


Thanks for your interest in Aspose products. I will answer your questions related to Aspose.Words.

In case Aspose.Words encounters a problem that can be resolved upon loading a document, it recovers that document silently. In your case, during loading your “2_PDFtoDOCX.docx” document in memory by using the latest version of Aspose.Words 13.8.0 and then converting/rendering it to DOCX/PDF formats, I have observed that the latest version of Aspose.Words correctly mimics the behaviour of Microsoft Word 2013. To confirm the correctness of Aspose.Words 13.8.0, I have attached the following three files here for your reference.

  1. out-aspose.words-13.8.0.docx: It was generated out from “2_PDFtoDOCX.docx” using Aspose.Words 13.8.0
  2. out-ms-word-2013.docx: It was generated out from “2_PDFtoDOCX.docx” using Microsoft Word 2013. The layout of document elements is the same as can be seen in Aspose.Words’ generated output.
  3. out-aspose.words-13.8.0.pdf: It was generated out from “out-aspose.words-13.8.0.docx” using Aspose.Words 13.8.0

Put simply, in order to fix the problems i.e. introduced during bad DOCX to good DOCX conversion and then good DOCX to final PDF rendering, you please upgrade to the latest Aspose.Words version from here. I hope, this helps.

Regarding your Aspose.Pdf related query, my colleagues from Aspose.Pdf component team will answer you shortly.

Best regards,

Thank you for the quick response.


I updated the reference with the lastest Aspose.Words dll (13.8.0.0) and the result PDF file is still offset to the top. I attached the following files if you can take a look at it:

code_snippet.cs - A snippet of how I generate the files below. I think this should be pretty much similar to what you have done.
1_pdftodocx.docx - pdf to docx with Aspoes.PDF
2_docxtodocx.docx - docx to docx with Aspose.Words
3_docxtopdf.pdf - docx to pdf with Aspose.PDF

Thanks, and hopefully the Aspose.PDF guys would chime in too.

Hi there,


Thanks for your Inquiry. You can easily find and replace any text from PDF document using Aspose.Pdf.
TextFragmentAbsorber object allows you to find text, matching a particular phrase, from a PDF document. Then accept method of Pages collection return a collection of TextFramgment and you can loop through all the fragments and get their properties like Text its Position FontName and FontSize etc. You can set/change value of any property as well. Please check following documentation link hopefully it will serve the purpose.


Please feel free to contact us for any further assistance.

Best Regards,

tilal.ahmad:
Hi there,

Thanks for your Inquiry. You can easily find and replace any text from PDF document using Aspose.Pdf.
TextFragmentAbsorber object allows you to find text, matching a particular phrase, from a PDF document. Then accept method of Pages collection return a collection of TextFramgment and you can loop through all the fragments and get their properties like Text its Position FontName and FontSize etc. You can set/change value of any property as well. Please check following documentation link hopefully it will serve the purpose.

Please feel free to contact us for any further assistance.

Best Regards,

Thank you for the suggestion. I am new to Aspose, and have completely missed the TextFragment object. I have these follow up questions:

1. If there are multiple text properties (different font type, size, etc), the TextFragment can only represent the first text property of the object. The other text properties are "lost". Correct?

2. This is regarding performance. Do you guys recommend text modification (such as translating a document) through using search and replace within Aspose.PDF? If I want to go with converting the document from pdf to docx, would there be a big performance hit?

3. This is from my original post. When you convert a pdf to docx with Aspose.PDF, it seems to be missing some of the properties. Is it necessary to do a double conversion to get a "true" docx file?

You can refer to the attached files "2_PDFtoDOCX.docx" and "3_DOCXtoDOCX.docx" (DocumentSamples.zip) from the original post.

Thank you.


Hi Ka Weng,

Ka Weng:
Thank you for the quick response.

I updated the reference with the lastest Aspose.Words dll (13.8.0.0) and the result PDF file is still offset to the top. I attached the following files if you can take a look at it:

code_snippet.cs - A snippet of how I generate the files below. I think this should be pretty much similar to what you have done.
1_pdftodocx.docx - pdf to docx with Aspoes.PDF
2_docxtodocx.docx - docx to docx with Aspose.Words
3_docxtopdf.pdf - docx to pdf with Aspose.PDF

Thanks, and hopefully the Aspose.PDF guys would chime in too.
Thanks for the additional information. While using the latest version of Aspose.Words i.e. 13.8.0, I managed to reproduce this issue on my side. I have logged this issue in our bug tracking system. The ID of this issue is WORDSNET-8883. Your request has also been linked to this issue and you will be notified as soon as it is resolved. Sorry for the inconvenience.

Best regards,
Hi there,

Thanks for your feedback.

kawenglou:

1. If there are multiple text properties (different font type, size, etc), the TextFragment can only represent the first text property of the object. The other text properties are "lost". Correct?

Yes you are correct, if you search a phrase it will return text properties of first word of the phrase. You can search an individual word instead a phrase.

kawenglou:

2. This is regarding performance. Do you guys recommend text modification (such as translating a document) through using search and replace within Aspose.PDF? If I want to go with converting the document from pdf to docx, would there be a big performance hit?

Hopefully there wouldn't be any performance issue. However the processing time will depend upon the PDF document contents/size and system resources.

kawenglou:

3. This is from my original post. When you convert a pdf to docx with Aspose.PDF, it seems to be missing some of the properties. Is it necessary to do a double conversion to get a "true" docx file?

I've noticed underline and strike off line is displaced. I've logged an investigation ticket for the issue as
PDFNEWNET-35781 in our issue tracking system. We will update you as soon as its resolved.

However, you can use Textbox value of RecognitionMode to preserve original look of PDF but , but the edit ability of the resulting document could be limited.

......
......
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Mode = DocSaveOptions.RecognitionMode.Textbox;
.....
.....

Best Regards,
awais.hafeez:
Hi Ka Weng,
Ka Weng:
Thank you for the quick response.

I updated the reference with the lastest Aspose.Words dll (13.8.0.0) and the result PDF file is still offset to the top. I attached the following files if you can take a look at it:

code_snippet.cs - A snippet of how I generate the files below. I think this should be pretty much similar to what you have done.
1_pdftodocx.docx - pdf to docx with Aspoes.PDF
2_docxtodocx.docx - docx to docx with Aspose.Words
3_docxtopdf.pdf - docx to pdf with Aspose.PDF

Thanks, and hopefully the Aspose.PDF guys would chime in too.
Thanks for the additional information. While using the latest version of Aspose.Words i.e. 13.8.0, I managed to reproduce this issue on my side. I have logged this issue in our bug tracking system. The ID of this issue is WORDSNET-8883. Your request has also been linked to this issue and you will be notified as soon as it is resolved. Sorry for the inconvenience.

Best regards,

Thank you. Hopefully this issue will be resolved soon.
tilal.ahmad:
Hi there,

Thanks for your feedback.

Yes you are correct, if you search a phrase it will return text properties of first word of the phrase. You can search an individual word instead a phrase.

Understood. One of the concern I have performing a word-for-word search and replace is, the last instance will overwrite the properties of all other instances. Unless you keep track of the structure yourself.

That leads me to the following question, will Aspose allows user to traverse through the file structure of a pdf file? Or is this outside of the project scope.

tilal.ahmad:
Hopefully there wouldn't be any performance issue. However the processing time will depend upon the PDF document contents/size and system resources.

Got it.

tilal.ahmad:
I've noticed underline and strike off line is displaced. I've logged an investigation ticket for the issue as
PDFNEWNET-35781 in our issue tracking system. We will update you as soon as its resolved.

However, you can use Textbox value of RecognitionMode to preserve original look of PDF but , but the edit ability of the resulting document could be limited.

......
......
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Mode = DocSaveOptions.RecognitionMode.Textbox;
.....
.....

Best Regards,

I would like to clarify: I am talking about the actual structure of the docx files.

In the attached screenshots, I am comparing the file structure of a PDFtoDOCX file converted by Aspose.PDF and a DOCXtoDOCX file converted by Aspose.Words. The PDFtoDOCX file is missing some properties and settings xml files. I need these properties and settings files in order to generate a docx in OpenXML.

Looking at the documentation, the Save() method allows the SaveOptions object and the ContentDisposition enum. But they don't seem to change the structure of the file itself.

Is this the intended behavior? Or am I missing a SaveOptions/SaveFormat somewhere?
awais.hafeez:
Hi,

Thanks for your interest in Aspose products. I will answer your questions related to Aspose.Words.

In case Aspose.Words encounters a problem that can be resolved upon loading a document, it recovers that document silently. In your case, during loading your "2_PDFtoDOCX.docx" document in memory by using the latest version of Aspose.Words 13.8.0 and then converting/rendering it to DOCX/PDF formats, I have observed that the latest version of Aspose.Words correctly mimics the behaviour of Microsoft Word 2013. To confirm the correctness of Aspose.Words 13.8.0, I have attached the following three files here for your reference.

  1. out-aspose.words-13.8.0.docx: It was generated out from "2_PDFtoDOCX.docx" using Aspose.Words 13.8.0
  2. out-ms-word-2013.docx: It was generated out from "2_PDFtoDOCX.docx" using Microsoft Word 2013. The layout of document elements is the same as can be seen in Aspose.Words' generated output.
  3. out-aspose.words-13.8.0.pdf: It was generated out from "out-aspose.words-13.8.0.docx" using Aspose.Words 13.8.0

Put simply, in order to fix the problems i.e. introduced during bad DOCX to good DOCX conversion and then good DOCX to final PDF rendering, you please upgrade to the latest Aspose.Words version from here. I hope, this helps.

Regarding your Aspose.Pdf related query, my colleagues from Aspose.Pdf component team will answer you shortly.

Best regards,

Sorry about quoting a previous post. But with my experience on both Aspose.Words 13.7.9 and 13.8.0, the result PDF will always be offset to the top.

So I just want to know how you did this? I would like to replicate this for the time being until the bug is fixed.

Thanks.
Hi there,

Thanks for your feedback.

kawenglou:

Understood. One of the concern I have performing a word-for-word search and replace is, the last instance will overwrite the properties of all other instances. Unless you keep track of the structure yourself.

That leads me to the following question, will Aspose allows user to traverse through the file structure of a pdf file? Or is this outside of the project scope.


I'm afraid currently traversing PDF file structure is not supported at the moment. However I've logged a feature request as PDFNEWNET-35795 for the same in our issue tracking system.


kawenglou:

I would like to clarify: I am talking about the actual structure of the docx files.

In the attached screenshots, I am comparing the file structure of a PDFtoDOCX file converted by Aspose.PDF and a DOCXtoDOCX file converted by Aspose.Words. The PDFtoDOCX file is missing some properties and settings xml files. I need these properties and settings files in order to generate a docx in OpenXML.

Looking at the documentation, the Save() method allows the SaveOptions object and the ContentDisposition enum. But they don't seem to change the structure of the file itself.

Is this the intended behavior? Or am I missing a SaveOptions/SaveFormat somewhere?


Thanks for sharing additional information. I've logged your comparison details as PDFNEWNET-35796 for enhancement in PDF to DOC/DOCX conversion. Moreover, currently there are no other settings/properties to overcome highlighted structure difference.

Best Regards,

Hi Ka Weng,


Thanks for your inquiry. Considering the documents you attached with this post, please try run the following code to be able to generate a good resultant PDF file.

//load the bad docx in Aspose.Words’
DOM

Document badDocx = new Document(@"C:\DocumentSamples\2_PDFtoDOCX.docx");

//re-save with Aspose.Words to generate a good docx

badDocx.Save(@"C:\DocumentSamples\out-aspose.words-13.8.0.docx");

//now load the good docx in Aspose.Words' DOM

Document goodDocx = new Document(@"C:\DocumentSamples\out-aspose.words-13.8.0.docx");

//re-save with Aspose.Words to generate a resultant pdf

goodDocx.Save(@"C:\DocumentSamples\out-aspose.words-13.8.0.pdf");


Best regards,
tilal.ahmad:

I'm afraid currently traversing PDF file structure is not supported at the moment. However I've logged a feature request as PDFNEWNET-35795 for the same in our issue tracking system.


Thanks.

tilal.ahmad:

Thanks for sharing additional information. I've logged your comparison details as PDFNEWNET-35796 for enhancement in PDF to DOC/DOCX conversion. Moreover, currently there are no other settings/properties to overcome highlighted structure difference.

Best Regards,


Correct me if I am wrong, but aren't the document settings/properties (docProps, etc) part of the OOXML standard? I am only bring this up because it seems unnecessary to have to use Aspose.Words to convert a "bad" docx to a "good" docx a second time in order to make the docx file standard compliant.

Anyway, I will wait for the fix. Thank you.
awais.hafeez:
Hi Ka Weng,

Thanks for your inquiry. Considering the documents you attached with this post, please try run the following code to be able to generate a good resultant PDF file.

//load the bad docx in Aspose.Words' DOM

Document badDocx = new Document(@"C:\DocumentSamples\2_PDFtoDOCX.docx");

//re-save with Aspose.Words to generate a good docx

badDocx.Save(@"C:\DocumentSamples\out-aspose.words-13.8.0.docx");

//now load the good docx in Aspose.Words' DOM

Document goodDocx = new Document(@"C:\DocumentSamples\out-aspose.words-13.8.0.docx");

//re-save with Aspose.Words to generate a resultant pdf

goodDocx.Save(@"C:\DocumentSamples\out-aspose.words-13.8.0.pdf");


Best regards,

I can confirm that 2_PDFtoDOCX.docx works using both your code (saving via physical files) and my code (saving via stream). However, none of the other docx works.

The 2_PDFtoDOCX.docx also look different from all the other converted docx that I have, so I must have done something different there. So far I could not reproduce this specific version of docx. The file is produced by Aspose.PDF, so there isn't a lot of SaveOptions that are available.

Will keep you posted if I find anything. Thanks Awais.

Update:

I am able to reproduce the 2_PDFtoDOCX.docx now. It seems like using Aspose.Pdf.DocSaveOptions would produce a better compatible docx than simply save using Aspose.Pdf.SaveFormat.

Hi Ka Weng,


Thanks for the additional information. Rest assured, we will keep you informed of any developments and let you know once your issues are resolved.

Best regards,

Thank you for the help, and will definitely keep this bookmarked.

The issues you have found earlier (filed as WORDSNET-8883) have been fixed in this .NET update and this Java update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.

The issues you have found earlier (filed as PDFNEWNET-35781) have been fixed in Aspose.Pdf for .NET 8.9.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

Hi there,


Thanks for your patience. As stated above PDFNEWNET-35796 is fixed in Aspose.Pdf for .NET 8.9.0. release. Please check following code snippet to convert PDF to DOC/DOCX, now output DOC/DOCX document contains the missing properties and setting XML files.

Document doc = new Document(inFile);<o:p></o:p>

DocSaveOptions saveOptions = new DocSaveOptions();

saveOptions.Format = DocSaveOptions.DocFormat.DocX;

saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;

string outFileName = inFile.Replace(".pdf", "_35796.docx");

doc.Save(outFileName, saveOptions);

Please feel free to contact us for any further assistance.


Best Regards,

Hi there,


Thanks for your patience. Our development team is investigating the issue (PDFNEWNET-35795-To traverse through PDF structure). We need some more details for the requirement. What do you mean by “traversing”?

In reference to your discussion in this thread you want to get access to text properties and that cane be done with TextFragmentAbsorber + TextFragment approach successfully.


Please provide more details, probably i’m missing something so it will help us to analyze and implement your feature request.


Looking forward to your feedback.


Best Regards,