Is it possible to extract text with formatting from a PDF document?

We currently have purchased license for Aspose.Word and Aspose.Pdf and we are in process of evaluating Aspose.Pdf.Kit.


We used the evaluation version of the Aspose.Pdf.Kit to extract text from a PDF to display it on a RTE (Rich Text Editor) and plain text box. We noticed that the extracted text does not retain formatting.

Is it possible to retain formatting? We would need the HTML version of the extracted text.

From the documentation for Aspose.Pdf.Kit, I noticed that there was no mention about retaining formatting when we extract text.

I also noticed from the documentation for Aspose.Recognition (for .net), that it converts the PDF to Word which would allow us then to extract html from the word doc using Aspose.Word. However this is dependant on a .net env and our production environment is unix/java.

Is there any other product of Aspose that we can use to get this functionality?

Hello Nimalan,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

I am sorry to inform you that, Extraction of text with formatting information is not supported in Aspose.Pdf.Kit.

Currently only Aspose.Recognition is a product that can be used to convert existing Pdf into HTML/Word format, but as you have mentioned, it’s only available in .NET version.

We apologize for your inconvenience.

Hi,

The current version of Aspose.Pdf.Kit does not support the feature of extracting text with format information.But we have a plan to develop this feature and I hope it could be available within about 2-3 months.

And I have also logged this requirement as PdfKitJava-6024 in our tracking system, so that we could inform you in this thread when we would make it.

Thanks,

Thanks,
Nimalan

The issues you have found earlier (filed as 6024) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.
(1)