I have PDF converted to DOCX after conversion the paragraphs are created into textbox. Is there a way to collect those text/paragraphs beside the checkbox?
I tried DocSaveOptions.RecognitionMode.EnhancedFlow and DocSaveOptions.RecognitionMode.Flow but it gives me unstable result for different word document.
In word without the conversion from PDF I can achieve this by getting the height and width of the shape of the checkbox then using the shape.ParentParagraph.GetText(). It is not working if the document are converted from pdf to word.
@alexey.noskov I’m using the Aspose.Words.Document during the save operation from the output of the DocSaveOption.
I attached the file for your reference. The original template is the template we use that works on the shape.ParentParagraph.GetText(). Options.pdf is the file that is the output from the original template, options.docx is the file converted from the pdf.
We generate the pdf from the original template with merge fields. Then upload in our storage, then if needed, we redownload the document and convert to docx again so we can get the fields.
@JSAN28options.docx was not produce by Aspose.Words. The following cod:
Aspose.Words.Document doc = new Aspose.Words.Document(@"C:\Temp\in.pdf");
doc.Save(@"C:\Temp\out.docx");
produces the following document: out.docx (9.0 KB)
I am afraid it is impossible. First of all, please note, Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model. On the other hand PDF documents are fixed page format documents . While loading PDF document, Aspose.Words converts Fixed Page Document structure into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity. Also, the difference in MS Word and PDF document models does not preserve all MS Word document features after DOCX->PDF->DOCX roundtrip. Fields will not be preserved after such roundtrip.
@JSAN28 In Aspose.Words produced out.docx each paragraph with “checkbox” is represented as a list item. So you can get the list item paragraphs and get the text.
@alexey.noskov I tried checking the document by looping the paragraph in use the paragraph.IsListItem but those are not list item. Is there other way like getting the position of the shape then get the nearest paragraph with that position?
@JSAN28 Have you tested with DOCX document produced by Aspose.Words from your PDF or with the DOCX document you have attached earlier? As I have mentioned the document you have attached was not produced by Aspose.Words. Looks like it was produced by Aspose.PDF.
If inspect structure of the options.docx document you have attached, it looks like this:
@alexey.noskov I update the Aspose.Words to 23.12 and it generate the file but when I open the file it shows that it has an error. I use the options.pdf that I uploaded earlier. and use the same code you provided.