Aspose.word for python via .Net

Hi Team,
we are using aspose.words for our application to extract content from word documents. while reading a single document with 60 pages whole application hangs and there are no exceptions or logs getting generated. it’s docx file with financial reports. the same document when converted to doc works fine. since it’s sensitive data, sharing file will not be possible. any idea on how to identify the issue or thoughts on what could be the issue?

@Sijo.Kolambran can you please describe in more detail what you are trying to do when the app hangs (it would be helpful if you post the code you are using)?
Also, without the original document, it will be difficult to reproduce your problem. Can you replace the sensitive information and attach the document?

This is the code that we are using to parse the content from the word file

def parse_word(file_path):

    # 1) Load the word document

    doc = aw.Document(file_path)
    parsed_word_result: List[TextwordBody]  = []

    # 2) Extract page level data

    for page in range(doc.page_count):
   
        # 3) Extract the page level data

        extracted_page = doc.extract_pages(page, 1)
        tables = []
        header_footers = []
        paragraph_list = []
        comments = []

        # 4) Extract table data

        for table in extracted_page.get_child_nodes(aw.NodeType.TABLE,is_deep=True):

            # load table node as table
            table =table.as_table()
            df = table2df(table)
            tables.append(df)

        # 5) Extract header-footer  

        for header_footer in extracted_page.get_child_nodes(aw.NodeType.HEADER_FOOTER,is_deep=True):

            header_footer = header_footer.to_string(aw.SaveFormat.TEXT).strip()
            header_footers.append(header_footer)

        # 6) Extract comments  

        for comment in extracted_page.get_child_nodes(aw.NodeType.COMMENT,is_deep=True):

            comment = comment.to_string(aw.SaveFormat.TEXT).strip()
            comments.append(comment)

        # 7) Extract rest of the text from the page          

        paragraphs = extracted_page.get_child_nodes(aw.NodeType.PARAGRAPH,is_deep=True)

        for i,para in enumerate(paragraphs):

            #skip if the ancestor of para is table
            #skip if the parent node_type of the para is not comment/header_footer
            #Extract the rest of the node_type as text

            if ((para.get_ancestor(aw.NodeType.TABLE) is None) and
            (para.parent_node.node_type != aw.NodeType.COMMENT) and
            (para.parent_node.node_type != aw.NodeType.HEADER_FOOTER)):

                body = para.to_string(aw.SaveFormat.TEXT).strip()

                if i+1 < paragraphs.count:

                    if paragraphs[i+1].parent_node.node_type == aw.NodeType.COMMENT:

                        strip_line = paragraphs[i+1].to_string(aw.SaveFormat.TEXT).strip()
                        body = body.replace(strip_line, ' ')

                paragraph_list.append(body)

        parsed_word_result.append(

            TextwordBody(
                body = ' '.join(paragraph_list),
                header_footer = ' '.join(header_footers),
                comment = ' '.join(comments),
                table = tables
            )
        )

    return parsed_word_result

if we replace the content, the issue gets resolved and we are not able to reproduce the error

@Sijo.Kolambran from what you say, the problem seems to be caused by an error in the XML structure of the docx file, or because it contains a resource that the API cannot resolve or replace. In any case, it is impossible for us to determine the cause of the problem and recommend a solution if we do not have the source file (all files you upload to this forum are only accessible by the Aspose support team).

do we have any known issue reported something like this with word(specifically with docx) previously? if any can you let us know. we will try to replicate the issue with the mockupfile

and if possible can we connect over a meeting? we will be able to describe the issue better and show what’s happening. let me know if it’s possible.

do we have any known issue reported something like this with word(specifically with docx) previously? if any can you let us know. we will try to replicate the issue with the mockupfile

I can not find an open ticket that matches your scenario in Python or . NET.

do we have any known issue reported something like this with word(specifically with docx) previously? if any can you let us know. we will try to replicate the issue with the mockupfile

Sorry, but this is the only channel available to provide support.

To mitigate the problem, I recommend that you upgrade to the latest version of the Aspose API and check and install all the fonts included in the document.
Also, do you know how the document is created?

is it okay if we just upgrade the aspose words python package to latest version? and how do we check and install all the fonts ?

is it okay if we just upgrade the aspose words python package to latest version?

Yes, you can check here the latest package.

and how do we check and install all the fonts ?

To know what are the fonts you can use the following guide (I usually use the first approach), installing the fonts depends on your operating system, but is usually as simple as downloading them from Google and installing them locally.

Hi,
we tried upgrading the package, but the issue seems to be the same. when it reaches the doc = aw.Document(file_path) statement, it just freezes the code doesn’t move forward, no timeout error, no exception nothing, it stays like this for hours…is there a way out of this statement and raise some exception?

@Sijo.Kolambran the document have a property called warning_callback that can be used to log the errors. But as I said that is a property of the Document object, so you need to have the document to be able to access it. This seems to be an edge case loading documents specifically related with the structure of your file, and without access to that file I cannot provide a solution or raise a new Ticket to be reviewed by our Dev team.
To try to replicate the document structure, can you provide details about how the document is created (what app is used to create it and the version of the app; is preprocessed for a third party application first; etc.)

I couldn’t find any doc string for this doc.warning_callback() function. Can i get some information on how to use this function and what output does it give? this problem has occurred for lot of our documents. Not just a single case. it will be great if you can help me. I will see if the file can manipulated and send it to you.

@Sijo.Kolambran you can find the reference document here and an example of use in the Aspose public GitHub repo (line 2806).
If you are facing the same problem loading several documents, please try to attach one of them to be inspected for us (the documents posted in this forum are only visible by Aspose staff).

Thanks, I will keep you posted after trying this.

@Sijo.Kolambran Unfortunately, in the current version of Aspose.Words for Python there is no way to use callbacks. This feature request is logged as WORDSNET-24685.

FYI @eduardo.canal

This would not work?

doc = aw.Document(file_path)
warnings = aw.WarningInfoCollection()
doc.warning_callback = warnings
for info in warnings:
    print(info.source)

Hi attaching one of the sample files, our files have similar structure it’s more like converted pdf to word docx files. In this sample, it doesn’t load all the pages. can you please have a look?WBC_2022_Sustainability_Supplement_docx.docx (1.7 MB)

@Sijo.Kolambran the problem that you have with this document is that it only load the first 10 pages right?

Yes.in some cases it doesn’t load anything for similar documents and it gets stuck in bulk set of files