Extra paragraph is created when single line TXT document is imported into DOM using Java

dmytro.patkovskyi · May 4, 2020, 7:39pm

Repro in Java:

    String input = "This string has no line breaks.";
    ByteArrayInputStream inputStream = new ByteArrayInputStream(input.getBytes(StandardCharsets.UTF_8));
    Document document = new Document(inputStream);
    assertEquals(input, document.getText()); // FAIL: actual text == input + "\r\f"

This does not happen with other formats (e.g., for an .rtf document getText returns input + “\f”)
We use commercial license, version ‘20.1:jdk17’.

tahir.manzoor · May 4, 2020, 9:14pm

@dmytro.patkovskyi

Please use Node.toString(SaveFormat.TEXT) method to get the text of Node. Hope this helps you.

dmytro.patkovskyi · May 5, 2020, 9:00am

This returns input + “\r\n\r\n” which is even worse.

And I don’t see why handling a txt document vs rtf or docx with the same content should be different. In my opinion, getText() should return exactly the same string (just input) in all these cases.

Update:
I’ve spent some time debugging and found out that the structure of the document in my repro is:

Section
- Body
  - Paragraph 0
    - Run: “This string has no line breaks.”
  - Paragraph 1 (no children)

On paragraph 0: getText() returns “This string has no line breaks.\r”, toString(SaveFormat.TEXT) returns “This string has no line breaks.\r\n”. On paragraph 1: getText() returns “\f”, toString(SaveFormat.TEXT) returns “\r\n”.

Update 2: the bug boils down to Aspose adding empty Paragraph 1 to the document structure of a plaintext document without line breaks. This does not happen to docx/rtf documents without line breaks. This also does not happen to plaintext documents that do contain line breaks (if you change input in my repro to “This string has a line break.\n” the document structure remains exactly as above).

tahir.manzoor · May 5, 2020, 4:21pm

@dmytro.patkovskyi

You are facing the expected behavior of Aspose.Words. Body is a section-level node and can only be a child of Section. There can only be one Body in a Section. A minimal valid Body needs to contain at least one Paragraph.

When you import the TEXT into Aspose.Words’ DOM, a paragraph for end of section is created. You can remove last empty paragraph of document using Body.LastParagraph.Remove method.

The document you are testing does not has separate empty paragraph for end of section.

Moreover, please note that the returned string by Node.GetText method includes all control and special characters as described in ControlChar. You can use Node.GetText and Node.ToString according to your requirement.

dmytro.patkovskyi · May 5, 2020, 5:45pm

Simply removing last paragraph is not enough to fix this problem because I also need to check whether the file contained new lines before doing this (the extra paragraph is not inserted if the original file contained new lines).

It seems that .txt documents require complex special handling (post-processing) which is not necessary for other document formats (empty paragraph is not added to single-line rtf or docx). This is counter-intuitive and makes life harder for library consumers.

tahir.manzoor · May 5, 2020, 9:46pm

@dmytro.patkovskyi

We have noticed that extra paragraph is created when single line TXT document is imported into DOM. However, MS Word does not import it For the sake of correction, we have logged this problem in our issue tracking system as WORDSNET-20378 . You will be notified via this forum thread once this issue is resolved. We apologize for your inconvenience.

The “\f” shows that this paragraph is the last paragraph in the Body. You can check it by using Paragraph.IsEndOfSection property. The Node.GetText method includes all control and special characters. So, you can use this method for your requirement.

It would be great if you please share complete detail of your use case along with input documents and expected outputs. We will then answer your query accordingly.

aspose.notifier · June 14, 2020, 9:00am

The issues you have found earlier (filed as WORDSNET-20378) have been fixed in this Aspose.Words for .NET 20.6 update and this Aspose.Words for Java 20.6 update.