Various Exceptions

Hi,

My system is relying Aspose words to read Words document heavily. Recently, we undergo a batch test. Out of 1400 resumes in the test, i got >173 docs with exception thrown from aspose. Please help.

As the resumes are private and confidential, I cant just send all those resumes to you. I have to isolate all these 173 docs and modify them, reproduce the error message and send to you guys few samples.

Its kinda weird. As i got only OpenOffice in my desktop, when i recreate the error doc, i save using OpenOffice in Word2000/xp format. The error is gone. Only when i edit using Words in my colleague PC, the errors are reproduce.

Here are all the error messages i got and some samples doc. Some error messages are simply ridiculous. Most error messages below appear more than once for each

  1. “Cannot find node ‘?.?’ at position -1. Please report this file to word@aspose.com.” 1 doc
    “Cannot find node.doc” attached

  2. “Cannot find paragraph for the specified position. Please report this file to word@aspose.com.” 2 docs
    “Cannot find paragraph.doc” attached

  3. DetectFileFormat returns Aspose.Word.LoadFormat.FormatHtml (71 docs)
    - I reported this error before. Aspose fixed it by returning the docs type and throw an exception.
    I checked. It is Words document user save using a browser. We meant to use aspose to replace Words to open a Words document. But if Microsoft Words and Open Office can open and read it correctly, why can’t aspose? Do aspose plan to support it in the future?

  4. DetectFileFormat returns Aspose.Word.LoadFormat.FormatRtf or FormatUnknown (66 docs)
    - Again, same thing, i can open those docs correctly in Words and OpenOffice but not aspose.
    - I couldnt find anyting wrong bout the resume.

  5. “Table seem to be badly formed. Please report this file to word@aspose.com.” 3 docs
    “bad form table 1.doc” attached
    “bad form table 2.doc” attached

  6. “Found a group shape that does not have a group shape container. Please report this file to word@aspose.com.” 1 doc
    “group shape error.doc” attached

  7. “There is an inline picture in the document with a type that is not yet supported.” 3 docs
    I couldnt reproduce the error. There is an image in the resume. However, i tried to remove the rest of text and left with the image only. The error is not appears anymore. I try changing only some important info on the resume and resave in Words. The error is not reproduced. I am not sure whether is the picture causing the error

Is aspose planning to extract the picture from resume too? as in my case, i got lots candidate attach their picture in the resumes, i need to save all them in the database, instead of getting only the text and ignore the picture. Maybe save them as binary access through Range.Picture(index) … like Aspose.Word.Document(File.FullName).Range.Text ?

Some error messages just dont make sense to an end user. I cant just lets these errors released with my product on the street. Even if it is the “fault” of the authors that shouldnt insert feature that aspose not supported, shouldnt aspose show a more useful message like which line, what feature, what paragraph, so that the user could remove it. I spent lots time trying to delete line by line the important info in resume, at the same time the error still exists so that i can send to you guys.

Am I a free beta tester for aspose?

Thanks

Shu Yih Tay

Attachments

another attachment

asfasf

well, nice. just found out that i need to post 5 times to attach 5 docs. and i need to change the body to someting else for every attachment to prevent admin from refraining me posting “same” message multiple time.

You could have just zipped them into one file and attached.

Thanks for the great collection of problem documents, we will try to address these as quickly as possible, expect most of them supported in 1-2 weeks.

It’s not our purpose to have customers test our products, but it is a consequence of the fact that the binary word format although documented, some things are still not documented or unclear. This article still holds true https://docs.aspose.com/words/net/
You are lucky to have such a variety of documents prepared by different people using different editing approaches in MS Word. Thanks for submitting the issues to us and thanks for understanding.

Hi,

Thank you for your report.

Aspose.Word supports extracting of images as well as other document elements. What you need to use to accomplish this is the DocumentVisitor class. This is an abstract class so you need to derive from it and override one of its VistXXX methods, in the case of images extracting this is VisitInlineShapeStart. Then just process the parameter that represents an object being encountered in the document.

There’s a topic in our Programmers Guide that exactly contains a code sample demonstrating how to extract various graphical objects from the document including images. Please find it here:
https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/

Thanks for your prompt reply. and sorry for mine impoliteness.
I am downloading the latest version of aspose words and will rerun it. will report if any error are fixed. should have done this before posting. though, i read the releases histories. No fixes relate to the problem i reported i think.

Thanks again
Shu Yih

In addition to my previous post: furthermore, the new object model available in Aspose.Word 3.0 allows to do the same even easier. Since all the nodes are exposed, you are not forced to derive from DocumentVisitor any more. Here’s a code sample of the new approach:

Document doc = new Document("D:\MyDocument.doc");
NodeCollection nodes = doc.GetChildNodes(NodeType.InlineShape, true);

int imageCount = 0;

foreach (Node node in nodes)
{
    InlineShape shape = node as InlineShape;

    if (shape.ImageBytes != null)
    {
        string fileName = String.Format("D:\Image {0}.{1}", ++imageCount,
        shape.ImageFormat);

        BinaryWriter writer = new BinaryWriter(File.Create(fileName));
        writer.Write(shape.ImageBytes);
    }
}

Just ran all files again with latest dll 3.0.3.

“Table seem to be badly formed. Please report this file to word@aspose.com.”
This error is fixed. Thanks

The rest of the exceptions are thrown again.

Additional message appended on some exceptions in the new version
"For free technical support, please post this error and the file in the Aspose.Word Forums http://www.aspose.com"

Thanks

Shu Yih

For the error “DetectFileFormat returns Aspose.Word.LoadFormat.FormatHtml (71 docs)” above,

As i just found out that Aspose is able to process html files too, i tried to rename them and save in .html extension and process in Aspose. The result is similar to what I get from reading a Word file.

I would wish to implement this. However, instead of i capture the format in my application, then save a copy harddisc, rename extension, and read it again. Is it possible to encapsulate this in Aspose instead? Means for those file in this FormatHtml format, Aspose is able to convert them internally to html then read again and return me the result as in Range.Text and transparent to my application.

Thanks

Shu Yih

All of the problem files you attached are now handled successfully by Aspose.Word 3.1.2 (will release later today). The problems were caused probably by the fact the files were created in OpenOffice and deviated from the Word binary format slightly and Aspose.Word was not so forgiving to read them.

If you have any other files binary files that do not read, please attach them here too.

Aspose.Word can detect DOC, RTF and HTML files, but can read DOC and HTML files only at the moment. I know some applications create an RTF file and save it with .DOC extension, it sounds you might have some files like this and that’s why you might sometimes get “RTF not supported” exception while reading a .DOC file.

Sorry I don’t understand the problem with HTML files you are talking about. The document constructor that takes a file name or a stream will always try to autodetect the format of the file and read it appropriately.

Hi,

Thanks for the fix. I have downloaded and ran the test again for all.

I still getting error for the following

  1. Cannot find paragraph for the specified position.
  2. There is an inline picture in the document with a type that is not yet supported.
    I am facing problem to reproduce the error. Is there any possible mean you guy fix it without sending you the particular Words document?

Thanks

Shu Yih

Hi,

Thanks for the fix. I have downloaded and ran the test again for all.

I still getting error for the following

  1. Cannot find paragraph for the specified position.
  2. There is an inline picture in the document with a type that is not yet supported.
    I am facing problem to reproduce the error. Is there any possible mean you guy fix it without sending you the particular Words document? Is there a particular list of image type not supported by aspose?

For the html format, actually i can just ignore the checking. Aspose is processing correctly for html words. Last time html and rtf is throwing exception and i reported the error and aspose came out with FileFormat. I checked for FileFormat before Range.Text and throw my own exception . I didnt realize it is working properly now for html.

Thanks

Shu Yih

Some of your documents are “irregular”, for example I fixed “cannot find paragraph for the specified position” for one particular situation, but apparently it is more widespread. Maybe the docs are created in OpenOffice or in some other application. The point is I need that particular document if you want a fix.