Missing runs when listing text for document paragraphs

antonio.russo · March 2, 2010, 1:30pm

Hello,

I’m currently faced with a problem.
I open a document and list text content for each paragraph for the main text of a document.
After presenting the text for the paragraph (through the Paragraph.toText() method), I perform the same action but for each individual run contained inside that same paragraph.
My problem is that for some documents the text that is presented for the paragraph and the text presented for the complete set of runs is different, with some of the text disappearing when listing the runs.
This is a big problem as I’m trying to transform the paragraph text into another, while strictly maintaning style formating (for example, if you had something originaly and you transformed as to add the letter s to the end, then your final document should have somethings).

The code I used for listing the contents of the file is the following:

public static void main(String[] args) throws Exception
{
    // this method loads the Aspose license. It as been tested and validated
    ProgramConfigurationSet.initAsposeLicense();

    String nextFile = null;
    Scanner s = new Scanner(System.in);
    System.out.print("File to read:");
    nextFile = s.next();
    FileInputStream inputDocumentStream = new FileInputStream(new File(nextFile));
    Document doc = new Document(inputDocumentStream);
    inputDocumentStream.close();

    doc.joinRunsWithSameFormatting();
    for (Section sec: doc.getSections())
    {
        System.out.println("Retrieved a section");
        for (Paragraph p: sec.getBody().getParagraphs())
        {
            System.out.println("Retrieved a paragraph. Text: |"\ + p.getText()\ + "| Number of paragraphs in run:"\ + p.getRuns().getCount());
            for (Run r: p.getRuns())
            {
                System.out.println("Retrieved a run. Text: |"\ + r.getText()\ + "|");
            }
        }
    }
}

I’m also attaching a sample file for which I get the described behaviour (sample.doc) as well as a file with the output that running the code above gives out to me (sample.log.txt). If you check this last file, you can easily spot the symptoms by comparing the text retrieved for the paragraphs agains the text retrieved for its runs.

My initial suspicion was on the doc.joinRunsWithSameFormatting(); instruction so i’ve already tried running this code without it but got the same result.
I’va also tried different OS and JVM versions and the result is always the same.

I’ve searched the forum for any similar threads but came up empty.

Can anyone help me?

Cheers,
António Russo

AndreyN · March 2, 2010, 2:42pm

Hello

Thanks for your request. There are SmartTags inside paragraphs in your document. The method getRuns() returns just direct children of the current paragraph. You can open your document using DocumentExplorer (Aspose.Words demo application) to investigate your document’s structure. As a workaround, I think, in your case, you can use DocumentVisitor to achieve what you need. Please follow the link to learn more
https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/
Hope this helps you.
Best regards,

antonio.russo · March 3, 2010, 9:12am

Hi,

Thanks for your help. Your referal to DocumentExplorer helped a lot in understanding the problem.

Unfortunately, the way our software component is build doesn’t allow for usage of a DocumentVisitor implementation as it must support several other file formats that are not scope for Aspose Words.

Anyway, after browsing a little bit more on the Node thematic, we ended up setting up a navigation based on getFirstChild() and getNextSibling() methods.

Thanks,
António Russo

AndreyN · March 3, 2010, 1:19pm

Hi

It is perfect, that you already resolved the problem.
Best regards,