Last shape objects are missing while imported from word file

Gptrnt · February 3, 2023, 7:53pm

Hi,

In my project I am extracting content between some characters from uploaded word document (input) using Aspose, converted it to html and save it in the data base. if customer click document download (output) option I am fetching a customer template and replace a special word with all the extracted content. My downloaded document should be same as the uploaded document.

But when I upload the input file input.docx (64.7 KB), it giving the output output.docx (12.7 KB). Like last shape object is missing in the output document

I am attaching my sample document wrdHtmlWithReplacePoc.zip (754.3 KB). My output document should be same as the input file, So please help me to figure out the issue

Thank you

alexey.noskov · February 4, 2023, 6:19am

@Gptrnt The shape is removed by your code in TokenService.addAgendaItemContent method:

if(tempDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().length() == 0) {
    tempDoc.getLastSection().getBody().getLastParagraph().remove();
}

You remove the last paragraph from the document if it is an empty string, but in your case the last paragraph contains a shape, which is not handled by your condition.

Gptrnt · February 4, 2023, 7:18pm

Hi,

Yes I notice that part, but I have to remove only last empty paragraph. How could I check and remove the last paragraph which doesn’t contains other things, like shape. only empty text. Also When I notice, while extracting the content from uploaded document and converted it to html, Only shapes are shows in the html not the text inside shape.

So Please help me to figure out this two issues.

Thank you

alexey.noskov · February 5, 2023, 6:30am

@Gptrnt

You can use the following condition to check whether paragraph does not have any child nodes:

if(!tempDoc.getLastSection().getBody().getLastParagraph().hasChildNodes()) {
    // .................
}

Shape in your document is SmartArt diagram and it actually does not have text, but only placeholders for the content. Upon conversion document to HTML SmartArt is rendered to image as a result placeholders are not displayed. MS Word does the same when rendered SmartArt. For example here is PDF document produced by MS Word from your input document: ms.pdf (51.5 KB)

If your goal is to preserve all MS Word document features, HTML is not the best option for intermediate format. I would be better to use DOCX or FlatOpc, if you need to store the document as a string.

Gptrnt · February 15, 2023, 7:20am

Hi, ’

Can you suggest where or how in my code can use docx format to store the data in db, and same can be added in the document in my code

Thank you

alexey.noskov · February 15, 2023, 11:04am

@Gptrnt In your code you save a snippet of the document as HTML in the getHtmlContentFromBookMark method. You can change this method to save the document as DOCX and store it as byte array.
Then in TokenService class, in the method replaceToken method pass the byte array into addAgendaItemContent method and then in addAgendaItemContent method. In the addAgendaItemContent load the document from the byte array and insert it as you already do.

Gptrnt · February 15, 2023, 11:48am

Hi,

I am converting the word extracted content in to string by below code,

SaveOptions options = new DocSaveOptions();
options.setSaveFormat(SaveFormat.DOC);

ByteArrayOutputStream docStream = new ByteArrayOutputStream();
dstDocument.save(docStream, saveOptions);
String dstStr = docStream.toString();

Which converting without any error. While I take the same string and load it in the new document by below code which throwing “Unsupported file format: Unknown”

ByteArrayInputStream bais = new ByteArrayInputStream(dstStr.getBytes());
LoadOptions opts = new LoadOptions();
opts.setLoadFormat(LoadFormat.DOC);
Document tempDoc = new Document(bais, opts);

Also Actually I wants to convert the document to DOCX String instead of DOC, but while I try the save format options.setSaveFormat(SaveFormat.DOCX); it throwing error
An invalid SaveFormat for this options type was chosen.

Can you please help me with the right code

Thank you

alexey.noskov · February 15, 2023, 1:15pm

@Gptrnt DOC format is binary format and you cannot store it as a string, the same applies to DOCX format. If you need to store documents as a string, you should use FlatOpc (MS Word XML) format:

ByteArrayOutputStream docStream = new ByteArrayOutputStream();
dstDocument.save(docStream, SaveFormat.FLAT_OPC);
String dstStr = docStream.toString();

Then you can load the document using code like this:

ByteArrayInputStream bais = new ByteArrayInputStream(dstStr.getBytes());
Document tempDoc = new Document(bais);

This is expected, since DocSaveOptions is for binary DOC and DOT (document template) formats. In case of using DOCX or FLAT_OPC formats you should use OoxmlSaveOptions