Vorennor:When converting a searchable PDF to Doc/Docx file, file content is created as an image. Is there a way to get the content as a text inside de Word doc?
Vorennor:Second question: what is the best way (using Aspose.PDF lib) to know if a PDF document is a searchable PDF?
....
var textabsorber = new Aspose.Pdf.Text.TextAbsorber();
page.Accept(textabsorber);
string content = textabsorber.Text;
if (content.Trim().Length == 0)
return true;
return false;
....
Please feel free to contact us for any further assistance.
Best Regards,
Dear Tilal,
Hi Christophe,
riccardod96ad:are there any updates about this issue?I'm evaluating Aspose using the aspose-pdf-11.8.0.jar version and I got the same problem when I performed a conversion from a searchable pdf to a docx file.Hi Riccardo,Thanks for using our API's.I am afraid the earlier reported issues are still pending for review and are not yet resolved, as the team has been busy fixing other priority issues. However please note that these problems were reported for Aspose.Pdf for .NET and you are facing issues while using Aspose.Pdf for Java, so we suggest you to please share your input document causing this problem, so that we can test the scenario in our environment. We are sorry for this inconvenience.
Hi Riccardo,
Any update on this issue?
The logged issue is still pending due to other high priority issues and implementations to the API. We will surely update you within this forum thread as soon as some progress is made towards resolution of the issue. Please spare us little time.
We are sorry for the inconvenience.
Hello All,
Quick check, any update on this issue ? we are using it for our requirements and PDF’s are getting converted into image docx
As per our understandings, you are converting OCRd PDFs to DOCX format and text is not present in the output file except images. Please let us know if we understood correctly. Also, please share your sample PDF with us along with the platform information (.NET/Java) where you are using the API. We will share our feedback with you accordingly.
OCRd PDF to DOCX format and text is present in output file as a image.
cannot share sample pdf, you can use any searchable PDF, .Net
In order to render text inside output DOCX, please try to use following approach:
Document pdfDocument = new Document(dataDir + @"source.pdf");
foreach (var page in pdfDocument.Pages)
{
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
absorber.Visit(page);
foreach (TextFragment fragment in absorber.TextFragments)
{
fragment.TextState.RenderingMode = TextRenderingMode.FillText;
}
page.Resources.Images.Clear();
}
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
saveOptions.RelativeHorizontalProximity = 2.5f;
saveOptions.RecognizeBullets = true;
pdfDocument.Save(dataDir + @"output.docx", saveOptions);
Please note that above code snippet was tested with some sample files and it produced fine results. In order to reproduce and address the issue which you are facing, we do need a sample file from you so that we can investigate and resolve the issue accordingly. You can share the file in private message in case you cannot share it publicly.
You can send a private message by clicking over username and pressing Blue Message Button.