Searchable pdf to doc/docx

Hi,

When converting a searchable PDF to Doc/Docx file, file content is created as an image. Is there a way to get the content as a text inside de Word doc?

Second question: what is the best way (using Aspose.PDF lib) to know if a PDF document is a searchable PDF?

Best regards,

Christophe
Hi Christophe,

Vorennor:

When converting a searchable PDF to Doc/Docx file, file content is created as an image. Is there a way to get the content as a text inside de Word doc?

Thanks for your inquiry. We have already noticed the issue and logged a ticket PDFNEWNET-39491 in our issue tracking system for rectification. We have linked your query to the issue id and will update you as soon as it is resolved.

We are sorry for the inconvenience caused.

Best Regards,

Hi Christophe,

Vorennor:

Second question: what is the best way (using Aspose.PDF lib) to know if a PDF document is a searchable PDF?

You may check whether PDF document pages contain text or not for the purpose. However if there is any difference in your requirement and my understanding then please share some more details.

....

var textabsorber = new Aspose.Pdf.Text.TextAbsorber();

page.Accept(textabsorber);

string content = textabsorber.Text;

if (content.Trim().Length == 0)

return true;

return false;

....


Please feel free to contact us for any further assistance.


Best Regards,

Dear Tilal,


No that doesn’t help at all
Please open zipped searchable PDF document, or any other searchable PDF you may have. Load it to Aspose.Pdf. Then you will be able to get the text using the code you mentioned but converting to Word (doc or docx) leads to a Word document with images only.

So I still have no idea how to know if a PDF is searchable or not using Aspose.PDF.
My goal is to filter PDF that cannot be converter to Word without loosing text because of you ticket PDFNEWNET-39491

Regards,

Hi Christophe,


Thanks for additional information. I have noticed the issue and logged an ticket PDFNEWNET-40293 for further investigation and rectification. We will keep you updated about the issue resolution progress for further investigation and rectification.

We are sorry for the inconvenience caused.

Best Regards,
Hi,

are there any updates about this issue?
I'm evaluating Aspose using the aspose-pdf-11.8.0.jar version and I got the same problem when I performed a conversion from a searchable pdf to a docx file.

Best Regards,
Riccardo
riccardod96ad:
are there any updates about this issue?
I'm evaluating Aspose using the aspose-pdf-11.8.0.jar version and I got the same problem when I performed a conversion from a searchable pdf to a docx file.
Hi Riccardo,

Thanks for using our API's.

I am afraid the earlier reported issues are still pending for review and are not yet resolved, as the team has been busy fixing other priority issues. However please note that these problems were reported for Aspose.Pdf for .NET and you are facing issues while using Aspose.Pdf for Java, so we suggest you to please share your input document causing this problem, so that we can test the scenario in our environment. We are sorry for this inconvenience.

Hi Riccardo,


Thanks for your inquiry. I am afraid the reported issue is still not resolved, as product team is busy in resolving other issues in the queue. However, we have logged a related issue(PDFJAVA-36124) for Aspose.Pdf for Java as well and requested our product team to investigate and share an ETA/update at their earliest. We will not notify you as soon as more information is available.

We are sorry for the inconvenience.

Best Regards,

Any update on this issue?

@abhi0476

The logged issue is still pending due to other high priority issues and implementations to the API. We will surely update you within this forum thread as soon as some progress is made towards resolution of the issue. Please spare us little time.

We are sorry for the inconvenience.

Hello All,

Quick check, any update on this issue ? we are using it for our requirements and PDF’s are getting converted into image docx

@MVK4ATOS

As per our understandings, you are converting OCRd PDFs to DOCX format and text is not present in the output file except images. Please let us know if we understood correctly. Also, please share your sample PDF with us along with the platform information (.NET/Java) where you are using the API. We will share our feedback with you accordingly.

OCRd PDF to DOCX format and text is present in output file as a image.

cannot share sample pdf, you can use any searchable PDF, .Net

@MVK4ATOS

In order to render text inside output DOCX, please try to use following approach:

Document pdfDocument = new Document(dataDir + @"source.pdf");

foreach (var page in pdfDocument.Pages)
{
 TextFragmentAbsorber absorber = new TextFragmentAbsorber();
 absorber.Visit(page);

 foreach (TextFragment fragment in absorber.TextFragments)
 {
  fragment.TextState.RenderingMode = TextRenderingMode.FillText;
 }
 page.Resources.Images.Clear();
}

DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
saveOptions.RelativeHorizontalProximity = 2.5f;
saveOptions.RecognizeBullets = true;

pdfDocument.Save(dataDir + @"output.docx", saveOptions);

Please note that above code snippet was tested with some sample files and it produced fine results. In order to reproduce and address the issue which you are facing, we do need a sample file from you so that we can investigate and resolve the issue accordingly. You can share the file in private message in case you cannot share it publicly.

You can send a private message by clicking over username and pressing Blue Message Button.