DOCX file not parsed properly

tvkreddy92 · November 3, 2014, 3:47am

Hi Team,

I have been using Aspose.words for parsing a token from .DOC word file.

Now I want to parse the same for .DOCX file

I assumed that my code should work for .DOCX file as well (correct me if I'm wrong here)

case1: .DOC file with content "{!name#value}"

case2: .DOCX file with content "{!name#value}"

// where {!name#value} is my token format which means give value of name object

This is my code which tries to parse the token string and replace with desired value:

----------------------------

// rawData of either .DOC or .DOCX file

public static void somefn(){

com.aspose.wordsDocument doc = new Document(new ByteArrayInputStream(rawData));

scanNode(doc);

}

public static void scanNode(Node node){

if (node instanceof Run) {

Run run = (Run)node;

String template = run.getText();

sysout(template);

// do something with tokem

}

if (!(node instanceof CompositeNode))

return;

CompositeNode cnode = (CompositeNode)node;

Node node2 = cnode.getFirstChild();

while (node2 != null) {

scanNode(node2, parser);

node2 = node2.getNextSibling();

}

Observations:

1) .DOC file prints

{!name#text}

2) .DOCX file prints

{!

name

#text

}

---------------------------------

The output is different for case1 and case2

I need the .DOCX to work same as .DOC

Am I missing something? Should the .DOCX file be parsed differently?

Thanks in advance,

Tvk Reddy

awais.hafeez · November 5, 2014, 1:42am

Hi Tvk,

Thanks for your inquiry. Could you please attach your input Word documents (.doc and .docx files) and complete source code here for testing? We will investigate the issue on our end and provide you more information.

Best regards,

tvkreddy92 · November 5, 2014, 4:31am

Hi Hafeez,

Please find attached the java class showing an example class along with the DOC and DOCX files

Im using aspose words 14.8.0 version

You can run the java class and see that both files are parsed differently.

See the exact output for both files below:

{!name#text}

-----------------------

{!

name

#text

}

Thanks,

Tvk

awais.hafeez · November 6, 2014, 5:22am

Hi Tvk,

Thanks for the additional information. Please do the following change in main method of your code:

Document doc = getDoc(pathnameDoc);

Document docx = getDoc(pathnameDocx);
scanNode(doc);

System.out.println("-----------------------");

docx.joinRunsWithSameFormatting();

scanNode(docx);

The Document.JoinRunsWithSameFormatting method joins runs with same formatting in all paragraphs of the document.

This is an optimization method. Some documents contain adjacent runs with same formatting. Usually this occurs if a document was intensively edited manually. You can reduce the document size and speed up further processing by joining these runs.

The operation checks every Paragraph node in the document for adjacent Run nodes having identical properties. It ignores unique identifiers used to track editing sessions of run creation and modification. First run in every joining sequence accumulates all text. Remaining runs are deleted from the document.

Best regards,

tvkreddy92 · November 10, 2014, 12:08am

Hi Hafeez,

That works great!! Thanks for your help

Tvk