Extract TOC TOF Fields & other Elements into other Word Documents using C# or Java | Locate all Hyperlinks in DOCX & Extract them as Plain Text

Hello,

We are evaluating your Aspose.Words APIs for a project with a major US manufacturing company. We are looking for the capabilities to process programmatically Microsoft Word documents with the sophisticated content (richly formatted text, images, drawings, charts, equations, tables, imbedded objects, etc.) in order to extract fragments from source documents in to separate new Word documents. The new documents ought to preserve the formatting and the content of the original.

The document fragments intended for extraction could be identified by various criteria. For example:
• Each section identified by level 1 heading.
• A section of a document identified by a given heading.
• Each section identified by a next level heading under the given heading.
• Table of Content (TOC)
• Table of Figures (TOF)
• Objects identified by captions
• A table

This is not a complete list, but we would like to know if you library would enable us to perform such tasks, and if we might get technical assistance in our POC effort.

@PF1,

Thanks for your interest in Aspose.Words APIs and yes, you can meet all these requirements by using Aspose.Words. We would suggest you please take a look at Aspose.Words for .NET’s documentation. Generally, you can use codes from the following articles to solve a few of above problems:

However, you may provide us simplified document(s) along with screenshots highlighting the target areas that you want to extract and paste into other Word documents. We will then investigate those scenarios on our end and provide you custom code to achieve the desired output.

Please let us know if you need more information; we are always glad to help you.

Thank you for your reply. In addition to the content that we need to extract in to separate files stated in my original post, we need also be able to locate all hyperlinks referenced in a document and extract them as plain text.

I am looking for a way to attach a sample file that would illustrate our needs.

@PF1,

Please check the following code will help you to process hyperlinks in Word document:

Document doc = new Document("E:\\Temp\\in.docx");

foreach (Field field in doc.Range.Fields)
{
    if (field.Type == Aspose.Words.Fields.FieldType.FieldHyperlink)
    {
        FieldHyperlink hyperlink = (FieldHyperlink)field;

        Console.WriteLine("{0} | {1} ", hyperlink.DisplayResult, hyperlink.Address);
    }
}

In a process of evaluating your libraries we have encountered the following issues:

  • When extracting a fragment of a document in to a separate document we do not see the original file’s watermark being preserved. We observe ASPOSE inserted watermark identifying the fact that we are using a trial version. Would the original file’s watermarks be preserved when we start using the licenced version of the library and ASPOSE watermarks would no longer be applied?

  • When extracting a fragment of a document that contains embedded video in to a separate document the new document instead of video file appears to embed a still image. When we do the same with another library, the embedded video is extracted as such. Is this a defect that you intend to correct? Could it be that we are doing something wrong?

Thank you for your support.

@PF1,

If you want to test ‘Aspose.Words’ API without the evaluation/trial version limitations, then you can also request a 30-day Temporary License. Please refer to How to get a Temporary License?

In case the problems still remain, please ZIP and upload your simplified input Word document(s) and Aspose.Words generated output DOCX files showing the undesired behaviors along with piece of source code (console application) here for testing. We will then investigate the issues on our end and provide you more information.