Convert pdf to doc or docx

Hello,

I try to convert a pdf file to doc or docx.
I use VS 2010, W7, aspose.pdf for .NET 8.8.0 and Aspose.Words for .NET 13.12.0

I used your code example with a searchable pdf file ('Convert PDF file to DOC or DOCX format')
I always get a word file with only an image of the pdf file.

Is there a way to get characters instead of an image of the text.?

Thank you for your response

Hi Philippe,


Thanks for using our products.

Can you please share the source PDF file so that we can test the scenario at our end. We are sorry for this inconvenience.

Hello,

Here is the pdf file, in attachment.

Best regards.

Hi Philippe,


Thanks for sharing the resource file.

I
have tested the scenario and I am able to notice the same problem. For the sake
of correction, I have logged this problem as PDFNEWNET-36333 in
our issue tracking system. We will further look into the details of this
problem and will keep you updated on the status of correction. Please be
patient and spare us little time. We are sorry for this inconvenience.<o:p></o:p>

Hello,

Have you any news about this problem ?

Thank you for your response.

Hi Philippe,


Thanks for your patience.

The development team has been busy resolving
other priority issues and I am afraid the issue reported earlier is not yet
resolved. Nevertheless, I have requested the team to share the ETA regarding
its resolution. As soon as we have some definite updates regarding its
resolution, we would be more than happy to update you with the status of correction.
Please be patient and spare us little more time.<o:p></o:p>

We are really sorry for this inconvenience.

Hello,

I have no news.
My work is stopped since January.

I bought Aspose.Total.NET in June 2010 and I buy each year for renewal.
My subscription is going to expire May 29, 2014.

What I have to DO ?

Best regards

Hi Philippe,


First of all, please accept our humble apologies for the delay and inconvenience which you have been facing. I am afraid the issue reported earlier is not yet resolved. Nevertheless, I have again sent an intimation to the development team to share the possible ETA. As soon as we have some further updates, we will let you know.

Once again, we are sorry for this delay and inconvenience.

Hi Philippe,


Thanks for your patience.

We have further investigated the issue reported earlier and as per our observations, the source PDF file looks like an image, because PDF contains image only indeed with invisible text.

The document is an OCR recognition tool result - the image is placed to Pdf page as it is, but invisible text was added the over the image to make recognized text accessible.

The fonts are invisible and provide no graphics view. There is also no font face information.

We can implement an enhancement that will convert invisible fonts into visible fonts, then Pdf to Doc conversion can be performed, but there will be following limitations (followed by OCR tool):

  • There will be no font face information - just CourierNew font fill be used
  • There will be no font style information - the italic font will look like regular
  • The text will have different size even if it looks like the same size on the image

please look at the attached 36333_analisys.png image to see the limitation concepts. The enhancement is possible but with above stated limitations and the current ETA is 9.4.0 (early July release)

Furthermore, if you have any other OCR documents examples, it is highly recommended to share those files as they will help us in implementing this feature in more appropriate manner.

Hello,

I thank you for your explanations.

Pdf files that are OCR recognition tool results :
The enhancement you describe are essential.
In addition, the end user doesn't necessarily know if his pdf file of this type. It would be very useful to get back a status which indicates that the file is of this type, to be able to inform the user about the limitations.

Another OCR document example is in attachment : iPhone_user_guide_extract.pdf

Other Pdf files :
There is a problem with bullet lists in word files : the lines and the bullets are in separate frames.
Example : convert Plaquette_En 1_6.pdf to Plaquette_En 1_6.doc (see attachments).

I thank you to inform me about updates.

Best regards

Phil92:

Pdf files that are OCR recognition tool results :

The enhancement you describe are essential.
In addition, the end user doesn’t necessarily know if his pdf file of this type. It would be very useful to get back a status which indicates that the file is of this type, to be able to inform the user about the limitations.

Another OCR document example is in attachment : iPhone_user_guide_extract.pdf

Hi Philippe,

Thanks for sharing requested sample file. Definitely it will help us in the issue investigation and resolution.

Phil92:

Other Pdf files :
There is a problem with bullet lists in word files : the lines and the bullets are in separate frames.
Example : convert Plaquette_En 1_6.pdf to Plaquette_En 1_6.doc (see attachments).

We have logged your reported issue as PDFNEWNET-37022 in our issue tracking system for further investigation and resolution. We will keep you updated about your reported issues via this forum thread.

Thanks for patience and cooperation.

Best Regards,

The issues you have found earlier (filed as PDFNEWNET-36333) have been fixed in Aspose.Pdf for .NET 9.4.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.