We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

[Java ] aspose.words Document.toString() API is not returning contents in <title> tag and difference in behavior with different XML files

Hi,
I am using aspose-words-18.6-jdk16.jar (JAVA)

We are using this jar to extract file contents as string using the following APIs:
Document doc = new Document(“file_path”);
String textContent = doc.toString(SaveFormat.TEXT);

Problem 1:

For the attached file smartdoc.xml , we only get following text with no XML tags

Document Body
testing content of document

In this case, the problem is that title tag’s contents like “amphibians” is missing in textContent output.

Problem 2:
For file busdoc.xml, the whole of xml content with tags is returned in textContent like

<?xml version="1.0" encoding="UTF-8"?><?Xpress productLine="busdoc" ?>samples.zip (1.3 KB)

@Jaspreet16,

Thanks for your inquiry. Unfortunately, we have not found the attachments in your post. Please attach the following resources here for testing:

  • Your input Word document.
  • Please attach the output Word file that shows the undesired behavior.
  • Please create a standalone console application (source code without compilation errors) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we will start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip and upload them.

samples_with_text_outputs.zip (2.3 KB)

Kindly see the samples_with_text_outputs.zip for source txt files and output txt files.
It is to be observed that for smartdoc.txt the title content is missing in smartdoc_output.txt
and in case of busdoc.txt, the whole of XML is returned as it it with XML tags.Thus, there is no consistency while emitting text content from a txt file and in other case text content of tag TITLE is missing.

I am using the following APIs of aspose-words-18.6-jdk16.jar (JAVA)
Document doc = new Document(“smartdoc.txt”);
String outputTextContent = doc.toString(SaveFormat.TEXT);

Kindly interrogate the value of outputTextContent .

@Jaspreet16,

Thanks for sharing the detail. Please note that Aspose.Words mimics the behavior of MS Word.

Aspose.Words imports the Title tag correctly into its DOM. You can get the title of document using BuiltInDocumentProperties.Title property. In your case, we suggest you please use following code example. Hope this helps you.

LoadOptions options = new LoadOptions();
options.setLoadFormat(LoadFormat.HTML);

Document doc = new Document(MyDir + "busdoc.txt", options);
System.out.println(doc.getBuiltInDocumentProperties().getTitle());
doc.save(MyDir + "18.7.txt");

Thanks for the response.

But, with the above reply, it implies there is no way to get text content from XML tags using following code.
If there are are sections in a document and every section has title tag, which will make recursion code to be implemented for every section’s title. Kindly investigate again why text in other tags like body is returned but not the one in title tag using the following code -

Document doc = new Document(“DocumentWithSection.txt”);
String outputTextContent = doc.toString(SaveFormat.TEXT);

DocumentWithSection (1).zip (641 Bytes)

Kindly also see why the output is different for two sample examples in zip.
In case of smartdoc.txt, only text content is returned but in case of busdoc.txt whole of XML is returned as it is.
Why such difference ?

Thanks,
Jaspreetsamples_with_text_outputs.zip (2.3 KB)

@Jaspreet16,

Thanks for your inquiry. Please note that Aspose.Words supports the load formats mentioned in following link.
https://apireference.aspose.com/java/words/com.aspose.words/LoadFormat

All files have the header by which one can understand that the format and version of the format. Your input documents are not valid HTML or Open XML documents. If you open them in MS Word, you will get the message “Custom XML elements are not supported by Word”. Please check the attached image for detail. ms word.png (29.3 KB)