We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Text extraction issue when long text is contained in MS Word table cell

I have an application which needs to extract plain text from Pdf documents. The Pdf documents are created from MS Word documents using a non-Aspose Pdf creator, and some of the content of the MS Word document resides in one or more tables. Now, the Aspose library will extract the plain text of the Pdf, including the text contained in the MS Word tables. The problem occurs when the text in the Word table cell is long, and wraps to multiple lines inside the table cell. In that case the Aspose library extracts the text, but outputs it to multiple text lines as if the wrapped table cell contents were individual lines CR/LF terminated. This is incorrect behavior for my application because I need the contents of the table cell to be output into a single text line.

Now, Adobe Acrobat Pro, using the SaveAsText function, correctly outputs the table contents, with long cell contents unwrapped and on the same output text line. So I know it is possible to extract text correctly (for my purposes).

Please see the attached Pdf example. It illustrates the problem.

Thank you,
Stephen

Hi Stephen,


Thanks
for using our API’s.
<o:p></o:p>

I
have tested the scenario and I am able to notice the same problem. For the sake
of correction, I have logged this problem as PDFNEWNET-39445 in
our issue tracking system. We will further look into the details of this
problem and will keep you updated on the status of correction. Please be
patient and spare us little time. We are sorry for this inconvenience.