How to get text content beside the shapes

Hello!

I have PDF converted to DOCX after conversion the paragraphs are created into textbox. Is there a way to collect those text/paragraphs beside the checkbox?

I tried DocSaveOptions.RecognitionMode.EnhancedFlow and DocSaveOptions.RecognitionMode.Flow but it gives me unstable result for different word document.

In word without the conversion from PDF I can achieve this by getting the height and width of the shape of the checkbox then using the shape.ParentParagraph.GetText(). It is not working if the document are converted from pdf to word.

I appreciate any help. Thank you.

@JSAN28 Looks like you are using Aspose.PDF to convert PDF to DOCX. Please try using Aspose.Words for conversion:

Aspose.Words.Document doc = new Aspose.Words.Document(@"C:\Temp\in.pdf");
doc.Save(@"C:\Temp\out.docx");

Also, please attach your input and output documents here for testing. We will check the issue and provide you more information.

@alexey.noskov I’m using the Aspose.Words.Document during the save operation from the output of the DocSaveOption.

I attached the file for your reference. The original template is the template we use that works on the shape.ParentParagraph.GetText(). Options.pdf is the file that is the output from the original template, options.docx is the file converted from the pdf.

orginal_template.docx (13.1 KB)

options.pdf (27.0 KB)

options.docx (14.0 KB)

We generate the pdf from the original template with merge fields. Then upload in our storage, then if needed, we redownload the document and convert to docx again so we can get the fields.

@JSAN28 options.docx was not produce by Aspose.Words. The following cod:

Aspose.Words.Document doc = new Aspose.Words.Document(@"C:\Temp\in.pdf");
doc.Save(@"C:\Temp\out.docx");

produces the following document: out.docx (9.0 KB)

I am afraid it is impossible. First of all, please note, Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model. On the other hand PDF documents are fixed page format documents . While loading PDF document, Aspose.Words converts Fixed Page Document structure into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity. Also, the difference in MS Word and PDF document models does not preserve all MS Word document features after DOCX->PDF->DOCX roundtrip. Fields will not be preserved after such roundtrip.

@alexey.noskov I see. But is there a way to get the list of options from the out.docx by locating the shape of checkbox?

@JSAN28 In Aspose.Words produced out.docx each paragraph with “checkbox” is represented as a list item. So you can get the list item paragraphs and get the text.

@alexey.noskov I tried checking the document by looping the paragraph in use the paragraph.IsListItem but those are not list item. Is there other way like getting the position of the shape then get the nearest paragraph with that position?

@JSAN28 Have you tested with DOCX document produced by Aspose.Words from your PDF or with the DOCX document you have attached earlier? As I have mentioned the document you have attached was not produced by Aspose.Words. Looks like it was produced by Aspose.PDF.

If inspect structure of the options.docx document you have attached, it looks like this:


That makes it impossible to associate text with shape

Aspose.Words output DOCX looks like this:


Where is paragraph is a list item.

@alexey.noskov Here is the error when I try to use the code you provided . Using aspose words.

@JSAN28 Looks like you are using an old version of Aspose.Words. Please try using the latest 23.12 version of Aspose.Words.

@alexey.noskov I update the Aspose.Words to 23.12 and it generate the file but when I open the file it shows that it has an error. I use the options.pdf that I uploaded earlier. and use the same code you provided.

@JSAN28 It looks like you are trying to open PDF document using MS Word. Also, it looks like the PDF you are trying to open is produced by Aspose.PDF: