Link text not coming correctly from paragraph

anishv3 · August 17, 2018, 11:36am

Hi Team,

I’m trying to split the data from a word document based on paragaph. When I tried to get the text using GetText() method from a paragraph object, I’m getting the entire url and some weird chars. For eg. if i’m trying to get a link text ‘Google’ which contains the hyperlink ‘https://www.google.com’, then after using the GetText() method, I’m getting the below.
“\u0013 HYPERLINK "https://www.google.com" \u0014Google\u0015\r”
Please guide me to get the correct link text from paragraph.

regards,
Anish

tahir.manzoor · August 17, 2018, 6:56pm

@anishv3,

Thanks for your inquiry. Please use FieldHyperlink.Result property to get the desired output. Moreover, we suggest you please use Node.ToString method (SaveFormat.Text) to get the text of paragraph.

The Node.GetText method returns the text of this node and of all its children. The returned string includes all control and special characters as described in ControlChar.

anishv3 · August 20, 2018, 1:10pm

I’m not able to find the above properties. I’m using the paragraph object and not Node. Also, how should the FieldHyperlink.Result be used ? I’m reading the entire paragraph and not a link text only! Could you please provide simple code sample for this ?

I’m trying to create a sample PoC using the code samples given in the aspose site itself. Please see the code snippet I’m trying with.

using System;
using System.Collections;
using System.IO;

using Aspose.Words;
using Aspose.Words.Tables;
using Aspose.Words.Fields;
using Aspose.Words.Layout;
using Aspose.Words.Drawing;
using System.Drawing;
using System.Drawing.Imaging;
using System.Drawing.Drawing2D;
using System.Diagnostics;
using System.Collections.Generic;

namespace Aspose.Words.Examples.CSharp.Programming_Documents.Working_with_Styles
{
class ExtractContentBasedOnStyles
{
public static void Run()
{
// ExStart:ExtractContentBasedOnStyles
// The path to the documents directory.
string dataDir = RunExamples.GetDataDir_WorkingWithStyles();
string fileName = “A leader in the industry.docx”; //“TestFile.doc”;
// Open the document.
Document doc = new Document(dataDir + fileName);

        // Define style names as they are specified in the Word document.
        const string paraStyle = "Heading 2";
        //const string runStyle = "Body Text"; //"Intense Emphasis";

        // Collect paragraphs with defined styles. 
        // Show the number of collected paragraphs and display the text of this paragraphs.
        ArrayList paragraphs = ParagraphsByStyleName(doc, paraStyle);
        Console.WriteLine(string.Format("Paragraphs with \"{0}\" styles ({1}):", paraStyle, paragraphs.Count));
        foreach (Paragraph paragraph in paragraphs)
        {
            Console.Write(paragraph.ToString(SaveFormat.Text));
            Console.WriteLine(paragraph.ChildNodes.ToString());
        }
        // Collect runs with defined styles. 
        // Show the number of collected runs and display the text of this runs.
        ////ArrayList runs = RunsByStyleName(doc, runStyle);
        ////Console.WriteLine(string.Format("\nRuns with \"{0}\" styles ({1}):", runStyle, runs.Count));
        ////foreach (Run run in runs)
        ////{
        ////    Console.WriteLine(run.Range.Text);
        ////    Console.WriteLine(run.ParentParagraph.Range.Text);
        ////}
        // ExEnd:ExtractContentBasedOnStyles
        Console.WriteLine("\nExtracted contents based on styles successfully.");
    }
    // ExStart:ParagraphsByStyleName
    public static ArrayList ParagraphsByStyleName(Document doc, string styleName)
    {
        // Create an array to collect paragraphs of the specified style.
        ArrayList paragraphsWithStyle = new ArrayList();
        // Get all paragraphs from the document.
        NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);
        // Look through all paragraphs to find those with the specified style.


        List<Paragraph> tempParagraphs = new List<Paragraph>();


        foreach (Paragraph paragraph in paragraphs)
        {
            if (paragraph.GetText() != "\r")
            {
                tempParagraphs.Add(paragraph);

                if (paragraph.ParagraphFormat.Style.Name == styleName)
                    paragraphsWithStyle.Add(paragraph);
            }
        }
        return paragraphsWithStyle;
    }
    // ExEnd:ParagraphsByStyleName
    // ExStart:RunsByStyleName
    public static ArrayList RunsByStyleName(Document doc, string styleName)
    {
        // Create an array to collect runs of the specified style.
        ArrayList runsWithStyle = new ArrayList();
        // Get all runs from the document.
        NodeCollection runs = doc.GetChildNodes(NodeType.Run, true);
        // Look through all runs to find those with the specified style.
        foreach (Run run in runs)
        {
            if (run.Font.Style.Name == styleName)
                runsWithStyle.Add(run);
        }
        return runsWithStyle;
    }
    // ExEnd:RunsByStyleName
}

}

tahir.manzoor · August 20, 2018, 3:30pm

@anishv3

Thanks for your inquiry. Please use the latest version of Aspose.Words for .NET 18.8. Following code example shows how to get the hyperlink text. You can use Field.Result or FieldHyperlink.Result property as shown below. Hope this helps you.

//This document contains the hyperlink field 
Document doc = new Document(MyDir + "in.docx");

Field field = doc.Range.Fields[0];
Console.WriteLine(field.Result);

FieldHyperlink hyperlink = (FieldHyperlink)doc.Range.Fields[0];
Console.WriteLine("Hyperlink text" + hyperlink.Result);
Console.WriteLine("Hyperlink address " + hyperlink.Address);

anishv3 · August 21, 2018, 2:38pm

Hi,

Thanks for your quick response. Probably I put the question wrongly. What exactly I’m looking for is to pull all the text with hyperlinks which are under a heading style say “Header 1”.
My expectation was when I pull that using a function GetText() it should bring the text along with hyper link as it is in the original document.

I understand the FieldHyperLink does pull the individual links but my point is, is there a function which returns the text with hyperlink intact.

Thanks,
Anish

tahir.manzoor · August 21, 2018, 3:24pm

@anishv3

Thanks for your inquiry. Please ZIP and attach your input Word document along with expected output content. We will then provide you more information about your query along with code example.

anishv3 · August 28, 2018, 8:46am

Hi,

I’m attaching both documents for your reference.

MainPage.docx is the input document and SplittedPage.docx is the output. We should be able to extract certain paragraphs based on the heading styles and create a separate document. Here, as in the example we need to create a separate document with paragraphs whose heading style is ‘Heading 2’ without affecting it’s formatting.
Aspose word.zip (23.4 KB)

Regards,
Anish

tahir.manzoor · August 28, 2018, 3:19pm

@anishv3

Thanks for sharing the detail. In your case, we suggest you following solution.

Iterate over all paragraphs. CompositeNode.GetChildNodes method returns a live collection of child nodes that match the specified type.
Get the paragraph’s style using ParagraphFormat.StyleName property. If it is “Heading 2”, please do the step 3.
Iterate over next paragraphs and import them into another document until the ParagraphFormat.IsHeading is true.

anishv3 · August 30, 2018, 8:30am

Can you share a sample code for the above as mentioned earlier ?

tahir.manzoor · August 30, 2018, 4:32pm

@anishv3

Thanks for your inquiry. Please use the following code example to extract the content based on style “Heading 2”. We suggest you please read the following article and get the code of extractContent and generateDocument methods.
Extract Selected Content Between Nodes

public static void extractHeadingContent() throws Exception {
    Document doc = new Document(MyDir + "input.docx");
    int i = 1;
    DocumentBuilder builder = new DocumentBuilder(doc);
    NodeCollection nodes = doc.getChildNodes(NodeType.PARAGRAPH, true);
    for (Paragraph para : (Iterable<Paragraph>) nodes) {
        if (para.getParagraphFormat().isHeading() && para.getParagraphFormat().getStyleName().equals("Heading 2")) {
        //if (para.getParagraphFormat().isHeading()) {
            Paragraph paragraph = new Paragraph(doc);
            para.getParentNode().insertBefore(paragraph, para);
            builder.moveTo(paragraph);
            builder.startBookmark("bm_extractcontents" + i);
            builder.endBookmark("bm_extractcontents" + i);
            i++;
        }
    }

    builder.moveToDocumentEnd();
    builder.startBookmark("bm_extractcontents" + i);
    builder.endBookmark("bm_extractcontents" + i);

    for (int bm = 1; bm < i; bm++) {
        BookmarkStart bookmarkStart = doc.getRange().getBookmarks().get("bm_extractcontents" + bm).getBookmarkStart();
        BookmarkStart bookmarkEnd = doc.getRange().getBookmarks().get("bm_extractcontents" + (bm + 1)).getBookmarkStart();
        ArrayList  extractedNodes = ExtractContents.extractContent(bookmarkStart, bookmarkEnd, false);
        Document dstDoc = ExtractContents.generateDocument(doc, extractedNodes);
        dstDoc.save(MyDir + bm + "_out.docx");
    }
}