Identifying Placeholders for Populating Answers in a Questionnaire Word Document(Python)

kpk123 · December 5, 2024, 5:53am

Description:

I am working on a Python project to automatically populate answers into a questionnaire Word document. The answers are stored in a JSON file, where each question is associated with its respective answer.

The process involves:

Extracting the content from the uploaded Word document.
Matching the questions in the document with the JSON data.
Populating the answers into the questionnaire.

Problem:
Some questionnaires include specific placeholders for answers, such as:

Text boxes
Tick boxes (e.g., Yes/No options)
Bullets immediately following a question
Text areas labeled as “Insert text here,” etc.

Is there a way to programmatically identify these placeholders using Aspose.Words? If so, could you provide guidance or a sample approach for detecting these elements and populating answers appropriately?

Thank you for your assistance!

Professionalize.Discourse · December 5, 2024, 5:53am

@kpk123

To programmatically identify placeholders in a Word document for populating answers from a JSON file using Aspose.Words in Python, you can follow these steps:

Load the Word Document: Start by loading the Word document that contains the questionnaire.
Extract Content: Iterate through the document’s elements to find specific placeholders. You can look for text boxes, tick boxes, and specific text patterns (like “Insert text here”).
Match Questions with JSON Data: For each identified placeholder, check if it corresponds to a question in your JSON data.
Populate Answers: Replace the placeholders with the corresponding answers from the JSON file.

Here’s a sample code snippet to illustrate this process:

import aspose.words as aw
import json

# Load the Word document
doc = aw.Document("path/to/your/questionnaire.docx")

# Load JSON data
with open("path/to/your/answers.json") as json_file:
    answers = json.load(json_file)

# Iterate through paragraphs to find placeholders
for paragraph in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
    for run in paragraph.runs:
        # Check for specific placeholders
        if "Insert text here" in run.text:
            question_key = "Your question key here"  # Define how to map questions
            if question_key in answers:
                run.text = answers[question_key]  # Populate answer

# Save the modified document
doc.save("path/to/your/modified_questionnaire.docx")

Identifying Specific Placeholders

Text Boxes: You can access text boxes using doc.get_child_nodes(aw.NodeType.SHAPE, True) and check their text.
Tick Boxes: If you are using content controls for tick boxes, you can find them using doc.get_child_nodes(aw.NodeType.CONTENT_CONTROL, True).
Bullets: Check for bullet lists by iterating through list items in the document.

Conclusion

This approach allows you to dynamically identify and replace placeholders in a Word document with answers from a JSON file. Make sure to adapt the placeholder identification logic based on the specific structure of your questionnaire.

For more detailed information on manipulating Word documents with Aspose.Words, you can refer to the official documentation here.

Sources:
[1]: Extracting and Modifying Content in Word Documents | Aspose.Words …

alexey.noskov · December 5, 2024, 6:10am

@kpk123 I think you can consider using Mail Merge or LINQ Reporting Engine to fill your document with data.
In the first case placeholders are standard MS Word MERGEFIELD fields.
In the second case placeholders are represented like <<[val]>> text in the template. By the way with LINQ Reporting Engine you can use JsonDataSource.

kpk123 · December 6, 2024, 6:56am

Testword.docx (20.3 KB)

testanswer_json.zip (382 Bytes)

@alexey.noskov Thank you for your response. I appreciate the suggestions regarding Mail Merge and LINQ Reporting Engine. However, I’m not entirely sure how these would address my specific issue.

To clarify further, I have attached a sample Word document and the corresponding JSON file for reference. The question-matching part of my code is already implemented and working as expected.

My challenge lies in identifying the placeholders associated with the questions in the document where the answers need to be populated. These placeholders may take the form of text boxes, tick boxes, or specific areas marked for answers (e.g., “Insert text here”).

Could you kindly review the attached files and provide sample code on how to identify these placeholders programmatically using Aspose?

alexey.noskov · December 6, 2024, 7:23am

@kpk123 Thank you for additional information. Can your template and JSON be changed? If so, I would suggest you to use unique identifiers for answers and use there identifiers in your template. For example see the modified template and JSON:
data.zip (241 Bytes)
in.docx (16.7 KB)
Here is code to fill the template with data from JSON:

ds = aw.reporting.JsonDataSource("C:\\Temp\\data.json")
doc = aw.Document("C:\\Temp\\in.docx")
engine = aw.reporting.ReportingEngine()
engine.build_report(doc, ds)
doc.save("C:\\Temp\\out.docx")

and here is output:
out.docx (13.9 KB)

kpk123 · December 6, 2024, 8:51am

@alexey.noskov Unfortunately, manually defining where the answers should be populated in the template is not feasible for our use case. We need this process to be fully automated.

To address this, I’ve been attempting to identify shapes (e.g., text boxes) that immediately follow a paragraph. Below is the code snippet I used:

for paragraph in doc.get_child_nodes(NodeType.PARAGRAPH, True):  
    text = paragraph.get_text().strip()  

    matched = find_best_match(text, answer_json_filtered_by_page)  

    if matched:  
        next_node = paragraph.next_sibling  
        if next_node and next_node.node_type == NodeType.SHAPE:  
            print("Text box found")  

        while next_node and next_node.node_type != NodeType.SHAPE:  
            next_node = next_node.next_sibling  

        if next_node and isinstance(next_node, Shape) and next_node.shape_type == 1:  # ShapeType.TextBox  
            next_node.text_box.text = matched['Question_Answer']

The issue I encountered is that the code does not identify the next_node as a shape, even when a text box is present immediately after the paragraph.

alexey.noskov · December 6, 2024, 9:13am

@kpk123 Shape cannot be on the same level as a paragraph. Shape is always a child of paragraph. Please see our documentation to learn more about Aspose.Words Document Object Model:
https://docs.aspose.com/words/python-net/aspose-words-document-object-model/

in your document however, there are shapes only in the question 1.5.1.2 . In other cases these are paragraphs. For example:

Internally this looks like this:

<w:p w14:paraId="2B096643" w14:textId="169CB894" w:rsidR="000D2CAD" w:rsidRPr="00F53AF7" w:rsidRDefault="000D2CAD" w:rsidP="000D2CAD">
	<w:pPr>
		<w:pStyle w:val="Inspring"/>
		<w:rPr>
			<w:lang w:val="en-GB"/>
		</w:rPr>
	</w:pPr>
	<w:r w:rsidRPr="00F53AF7">
		<w:rPr>
			<w:lang w:val="en-GB"/>
		</w:rPr>
		<w:t>1.3.1</w:t>
	</w:r>
	<w:r w:rsidRPr="00F53AF7">
		<w:rPr>
			<w:lang w:val="en-GB"/>
		</w:rPr>
		<w:tab/>
	</w:r>
	<w:r>
		<w:rPr>
			<w:lang w:val="en-GB"/>
		</w:rPr>
		<w:t>Question one</w:t>
	</w:r>
	<w:r w:rsidRPr="00F53AF7">
		<w:rPr>
			<w:lang w:val="en-GB"/>
		</w:rPr>
		<w:t>.</w:t>
	</w:r>
</w:p>
<w:p w14:paraId="22A4E37F" w14:textId="77777777" w:rsidR="000D2CAD" w:rsidRPr="00F53AF7" w:rsidRDefault="000D2CAD" w:rsidP="000D2CAD">
	<w:pPr>
		<w:pStyle w:val="Kader"/>
		<w:pBdr>
			<w:top w:val="single" w:sz="4" w:space="1" w:color="BFBFBF" w:themeColor="background1" w:themeShade="BF"/>
			<w:left w:val="single" w:sz="4" w:space="4" w:color="BFBFBF" w:themeColor="background1" w:themeShade="BF"/>
			<w:bottom w:val="single" w:sz="4" w:space="1" w:color="BFBFBF" w:themeColor="background1" w:themeShade="BF"/>
			<w:right w:val="single" w:sz="4" w:space="4" w:color="BFBFBF" w:themeColor="background1" w:themeShade="BF"/>
		</w:pBdr>
		<w:shd w:val="clear" w:color="auto" w:fill="F9F9F9"/>
		<w:rPr>
			<w:lang w:val="en-GB"/>
		</w:rPr>
	</w:pPr>
</w:p>

As you can see there are two paragraphs. One with the question another is a paragraph with borders and background.