File Format

Kusumanchi.Rajesh · November 3, 2015, 3:29am

Hi,

on using the following line Document doc = new Document(inStream); with the file attached, an exception is thrown.

Exception

in thread "main" com.aspose.words.UnsupportedFileFormatException: Unknown file format.
at com.aspose.words.Document.zzY(Unknown Source)
at com.aspose.words.Document.zzZ(Unknown Source)
at com.aspose.words.Document.(Unknown Source)
at com.aspose.words.Document.(Unknown Source)

The same exception is thrown when a different file format like .pdf is used.

Is there a way in which we can find the get different exceptions for the two different scenarios. i.e if it any file format other than .docx or .doc it should give unsupportedfileformat and if the file is corrupted the it should mention that the file is corrupted.

Regards,

Rajesh

tahir.manzoor · November 3, 2015, 8:46am

Hi Rajesh,

Thanks for your inquiry. Please use the FileFormatUtil.detectFileFormat method
to detect the information about a format of a document stored in a disk
file or stream. Please read following documentation link for your kind reference.
https://docs.aspose.com/words/java/detect-file-format-and-check-format-compatibility/

Please use following code example to achieve your requirements. Hope this helps you.

FileFormatInfo info = FileFormatUtil.detectFileFormat(MyDir + "testcorript.docx");
// Display the document type. 
switch (info.getLoadFormat())
{
    case LoadFormat.DOC: System.out.println("\tMicrosoft Word 97-2003 document.");
        break;
    case LoadFormat.DOT: System.out.println("\tMicrosoft Word 97-2003 template.");
        break;
    case LoadFormat.DOCX: System.out.println("\tOffice Open XML WordprocessingML Macro-Free Document.");
        break;
    case LoadFormat.DOCM: System.out.println("\tOffice Open XML WordprocessingML Macro-Enabled Document.");
        break;
    case LoadFormat.DOTX: System.out.println("\tOffice Open XML WordprocessingML Macro-Free Template.");
        break;
    case LoadFormat.DOTM: System.out.println("\tOffice Open XML WordprocessingML Macro-Enabled Template.");
        break;
    case LoadFormat.FLAT_OPC: System.out.println("\tFlat OPC document.");
        break;
    case LoadFormat.RTF: System.out.println("\tRTF format.");
        break;
    case LoadFormat.WORD_ML: System.out.println("\tMicrosoft Word 2003 WordprocessingML format.");
        break;
    case LoadFormat.HTML: System.out.println("\tHTML format.");
        break;
    case LoadFormat.MHTML: System.out.println("\tMHTML (Web archive) format.");
        break;
    case LoadFormat.ODT: System.out.println("\tOpenDocument Text.");
        break;
    case LoadFormat.OTT: System.out.println("\tOpenDocument Text Template.");
        break;
    case LoadFormat.DOC_PRE_WORD_97: System.out.println("\tMS Word 6 or Word 95 format.");
        break;
    case LoadFormat.UNKNOWN:
    default: System.out.println("\tUnknown format.");
        break;
}

Kusumanchi.Rajesh · November 4, 2015, 1:50am

Hi Tahir,

This is the new query I have. As explained to you over the chat window, PFA the screenshots in notepad++ and the text file.

The special characters are showing more than one offset. And also, the nonbreaking hyphen is changed to some unicode .

Could you please help me normalize these special characters so that it can be treated as a single character instead of a combination of two or more characters.

Regards,

Rajesh

tahir.manzoor · November 5, 2015, 1:04am

Hi Rajesh,

Thanks for your inquiry. We have tested the scenario using following code example and have not found the shared issue. We have attached the output text document with this post for your kind reference.

There is no issue if we copy the text from output window of eclipse and paste is into notepad++. Please check attached image for detail.

Document doc = new Document(MyDir + "TestDocument.docx");
for (Paragraph para : (Iterable)doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    System.out.print(para.toString(SaveFormat.TEXT));
}
doc.save(MyDir + "Out.txt");

Kusumanchi.Rajesh · November 5, 2015, 1:21am

Hi Tahir,

Request you to go through all of the attached documents in the previous mail. Check for offsets after the special symbols. And also the nonbreaking Hyphen is been displayed as some unicode character, which is not desired.

Regards,

Rajesh

tahir.manzoor · November 5, 2015, 3:51am

Hi Rajesh,

Thanks for your inquiry. We have not found any Word document in this forum thread except the one which you sent via email.

Node.toString returns correct output. Text output looks good in notepad++. Could you please share some more detail about your requirements why you want notepad++ offset (Ctrl + G)? We suggest you please use String class methods to get the position of a characters. Hope this helps you.

Kusumanchi.Rajesh · November 24, 2015, 12:47am

Hi Tahir,

In the mentioned thread, the attached .docx file is having a special character ‘-’ which is not being reflected in the text generated by the aspose code.In the text it is being shown as RS . Could you please help me no this? I would like the text to show that symbol as ‘-’.

Regards,

Rajesh

tahir.manzoor · November 25, 2015, 12:13am

Hi Rajesh,

Thanks for your inquiry. Please note that Aspose.Words mimics the same behavior as MS Word does. If you convert your document to txt file format using MS Word and open it in Notepad++, you will get the same output. See the attached image for detail.

Please let us know if you have any more queries.

Kusumanchi.Rajesh · November 26, 2015, 5:06am

Hi Tahir,

PFA.

On using the following code I am getting an extra line in the output.

for (Run run : (Iterable) para.getRuns()) {
String runtext =run.toString(SaveFormat.TEXT);
System.out.println(runtext);

HYPERLINK "http://cert2-advance.lexis.com/api/document?collection=cases&id=urn:contentItem:3S4X-4VH0-008H-V1V5-00000-00&context=1000516"

Is there any way to avoid this line and get the text as we get in para.getText();

tahir.manzoor · November 27, 2015, 4:37am

Hi Rajesh,

Thanks for your inquiry. Your document contains Hyperlink field. A field in a Word document is a complex structure consisting of multiple nodes that include field start, field code, field separator, field result and field end. Fields can be nested, contain rich content and span multiple paragraphs or sections in a document. The Field class is a “facade” object that provides properties and methods that allow to work with a field as a single object.

The Start, Separator and End properties point to the field start, separator and end nodes of the field respectively.

The content between the field start and separator is the field code. The content between the field separator and field end is the field result. The field code typically consists of one or more Run objects that specify instructions. The processing application is expected to execute the field code to calculate the field result.

You are getting the correct output as you are getting Run’s text in your code example. Please use Paragraph.toString(SaveFormat.TEXT). Hope this helps you.

for(Paragraph para :
(Iterable) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    System.*out*.println(para.toString(SaveFormat.*TEXT*));
}

Kusumanchi.Rajesh · November 27, 2015, 5:25am

Hi Tahir,

I am not using para.getText() because I want the output to be in form of runs. Please suggest a method by which the output can be in form of runs and also I don’t get the extra HYPERLINK line in the output.

Regards,

Rajesh

tahir.manzoor · November 27, 2015, 9:46am

Hi Rajesh,

Thanks for your inquiry. Please use following code example to achieve your requirements and get the code of FieldsHelper class from Aspose.Words for Java examples repository at GitHub.

We suggest you please read the following documentation link for your kind reference.

How to Replace Fields with Static Text

Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "Input.docx");
FieldsHelper.convertFieldsToStaticText(doc, FieldType.FIELD_HYPERLINK);
for (Paragraph para : (Iterable)doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    for (Run run : (Iterable)para.getRuns()) {
        String runtext = run.toString(SaveFormat.TEXT);
        System.out.print(runtext);
    }
    System.out.println("");
}

Kusumanchi.Rajesh · December 9, 2015, 6:16am

Hi Tahir,

PFA.

I am not being able to use the following line

FieldsHelper.convertFieldsToStaticText(doc, FieldType.FIELD_HYPERLINK);

Regards,

Rajesh

tahir.manzoor · December 10, 2015, 12:38am

Hi Rajesh,

Thanks for your inquiry. FieldsHelper class is not the part of Aspose.Words API. Please get this class from following Github link and include it in your application.

Aspose.Words for Java examples repository at GitHub.

Please let us know if you have any more queries.

Kusumanchi.Rajesh · December 24, 2015, 3:56am

Hi Tahir

Is it possible to find whether a run is having a hyperlinked text using run and its NodeType ?

Example:

<a rel="nofollow" href="http://www.google.com">hello</a>

Explanation:

Please find the document attached “hello” is the hyperlinked word taking 2 run’s when we traverse the document using Asposewords.jar in java.

Regards

Rajesh

tahir.manzoor · December 28, 2015, 3:11am

Hi Rajesh,

Thanks for your inquiry. The hyperlink is represented by Field class. A field in a Word document is a complex structure consisting of multiple nodes that include field start, field code, field separator, field result and field end. Fields can be nested, contain rich content and span multiple paragraphs or sections in a document. The Field class is a “facade” object that provides properties and methods that allow to work with a field as a single object.

The Start, Separator and End properties point to the field start, separator and end nodes of the field respectively. The content between the field start and separator is the field code. The content between the field separator and field end is the field result. The field code typically consists of one or more Run objects that specify instructions.

Hope this answers your query. Please let us know if you have any more queries.