Text extraction issue when long text is contained in MS Word table cell

steveros · September 23, 2015, 11:27am

I have an application which needs to extract plain text from Pdf documents. The Pdf documents are created from MS Word documents using a non-Aspose Pdf creator, and some of the content of the MS Word document resides in one or more tables. Now, the Aspose library will extract the plain text of the Pdf, including the text contained in the MS Word tables. The problem occurs when the text in the Word table cell is long, and wraps to multiple lines inside the table cell. In that case the Aspose library extracts the text, but outputs it to multiple text lines as if the wrapped table cell contents were individual lines CR/LF terminated. This is incorrect behavior for my application because I need the contents of the table cell to be output into a single text line.

Now, Adobe Acrobat Pro, using the SaveAsText function, correctly outputs the table contents, with long cell contents unwrapped and on the same output text line. So I know it is possible to extract text correctly (for my purposes).

Please see the attached Pdf example. It illustrates the problem.

Thank you,
Stephen

codewarior · September 27, 2015, 5:19pm

Hi Stephen,

Thanks for using our API’s.

I have tested the scenario and I am able to notice the same problem. For the sake of correction, I have logged this problem as PDFNEWNET-39445 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

softboy · May 1, 2024, 2:06am

Hi,
@codewarior , we aslo have the same problem, Does the PDFNEWNET-39445 fixed?, i’m using 23.8 version

sergei.shibanov · May 2, 2024, 1:56pm

@softboy
This problem, unfortunately, has not been resolved. The task status displayed in this topic will change when the task is resolved.