I try to convert a pdf file to doc or docx. I use VS 2010, W7, aspose.pdf for .NET 8.8.0 and Aspose.Words for .NET 13.12.0
I used your code example with a searchable pdf file ('Convert PDF file to DOC or DOCX format') I always get a word file with only an image of the pdf file.
Is there a way to get characters instead of an image of the text.?
I
have tested the scenario and I am able to notice the same problem. For the sake
of correction, I have logged this problem as PDFNEWNET-36333 in
our issue tracking system. We will further look into the details of this
problem and will keep you updated on the status of correction. Please be
patient and spare us little time. We are sorry for this inconvenience.<o:p></o:p>
The development team has been busy resolving
other priority issues and I am afraid the issue reported earlier is not yet
resolved. Nevertheless, I have requested the team to share the ETA regarding
its resolution. As soon as we have some definite updates regarding its
resolution, we would be more than happy to update you with the status of correction.
Please be patient and spare us little more time.<o:p></o:p>
First of all, please accept our humble apologies for the delay and inconvenience which you have been facing. I am afraid the issue reported earlier is not yet resolved. Nevertheless, I have again sent an intimation to the development team to share the possible ETA. As soon as we have some further updates, we will let you know.
Once again, we are sorry for this delay and inconvenience.
We have further investigated the issue reported earlier and as per our observations, the source PDF file looks like an image, because PDF contains image only indeed with invisible text.
The document is an OCR recognition tool result - the image is placed to Pdf page as it is, but invisible text was added the over the image to make recognized text accessible.
The fonts are invisible and provide no graphics view. There is also no font face information.
We can implement an enhancement that will convert invisible fonts into visible fonts, then Pdf to Doc conversion can be performed, but there will be following limitations (followed by OCR tool):
There will be no font face information - just CourierNew font fill be used
There will be no font style information - the italic font will look like regular
The text will have different size even if it looks like the same size on the image
please look at the attached 36333_analisys.png image to see the limitation concepts. The enhancement is possible but with above stated limitations and the current ETA is 9.4.0 (early July release)
Furthermore, if you have any other OCR documents examples, it is highly recommended to share those files as they will help us in implementing this feature in more appropriate manner.
Pdf files that are OCR recognition tool results : The enhancement you describe are essential. In addition, the end user doesn't necessarily know if his pdf file of this type. It would be very useful to get back a status which indicates that the file is of this type, to be able to inform the user about the limitations.
Another OCR document example is in attachment : iPhone_user_guide_extract.pdf
Other Pdf files : There is a problem with bullet lists in word files : the lines and the bullets are in separate frames. Example : convert Plaquette_En 1_6.pdf to Plaquette_En 1_6.doc (see attachments).
The enhancement you describe are essential. In addition, the end user doesn’t necessarily know if his pdf file of this type. It would be very useful to get back a status which indicates that the file is of this type, to be able to inform the user about the limitations.
Another OCR document example is in attachment : iPhone_user_guide_extract.pdf
Hi Philippe,
Thanks for sharing requested sample file. Definitely it will help us in the issue investigation and resolution.
Phil92:
Other Pdf files : There is a problem with bullet lists in word files : the lines and the bullets are in separate frames. Example : convert Plaquette_En 1_6.pdf to Plaquette_En 1_6.doc (see attachments).
We have logged your reported issue as PDFNEWNET-37022 in our issue tracking system for further investigation and resolution. We will keep you updated about your reported issues via this forum thread.