Free Support Forum - aspose.com

HyperLink Field identification

I am sorry to have to ask this because I have this strong feeling that I have missed a basic step and have wasted a lot of time.


a simple word doc containing: -
<span style=“font-size:11.0pt;line-height:115%;
font-family:“Calibri”,“sans-serif”;mso-ascii-theme-font:minor-latin;mso-fareast-font-family:
Calibri;mso-fareast-theme-font:minor-latin;mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:“Times New Roman”;mso-bidi-theme-font:minor-bidi;
mso-ansi-language:EN-GB;mso-fareast-language:EN-US;mso-bidi-language:AR-SA”><span style=“font-size:12.0pt;line-height:115%;
font-family:“Times New Roman”,“serif””>www.spectrags.com
<span style=“font-size:11.0pt;line-height:115%;
font-family:“Calibri”,“sans-serif”;mso-ascii-theme-font:minor-latin;mso-fareast-font-family:
Calibri;mso-fareast-theme-font:minor-latin;mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:“Times New Roman”;mso-bidi-theme-font:minor-bidi;
mso-ansi-language:EN-GB;mso-fareast-language:EN-US;mso-bidi-language:AR-SA”>
<span style=“font-size:11.0pt;line-height:115%;
font-family:“Calibri”,“sans-serif”;mso-ascii-theme-font:minor-latin;mso-fareast-font-family:
Calibri;mso-fareast-theme-font:minor-latin;mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:“Times New Roman”;mso-bidi-theme-font:minor-bidi;
mso-ansi-language:EN-GB;mso-fareast-language:EN-US;mso-bidi-language:AR-SA”>I am looping through nodes and find a Paragraph
<span style=“font-size:11.0pt;line-height:115%;
font-family:“Calibri”,“sans-serif”;mso-ascii-theme-font:minor-latin;mso-fareast-font-family:
Calibri;mso-fareast-theme-font:minor-latin;mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:“Times New Roman”;mso-bidi-theme-font:minor-bidi;
mso-ansi-language:EN-GB;mso-fareast-language:EN-US;mso-bidi-language:AR-SA”>
<span style=“font-size:11.0pt;line-height:115%;
font-family:“Calibri”,“sans-serif”;mso-ascii-theme-font:minor-latin;mso-fareast-font-family:
Calibri;mso-fareast-theme-font:minor-latin;mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:“Times New Roman”;mso-bidi-theme-font:minor-bidi;
mso-ansi-language:EN-GB;mso-fareast-language:EN-US;mso-bidi-language:AR-SA”>in the paragraph I am grabbing the Run collection ( I need to detect style changes run to run)
<span style=“font-size:11.0pt;line-height:115%;
font-family:“Calibri”,“sans-serif”;mso-ascii-theme-font:minor-latin;mso-fareast-font-family:
Calibri;mso-fareast-theme-font:minor-latin;mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:“Times New Roman”;mso-bidi-theme-font:minor-bidi;
mso-ansi-language:EN-GB;mso-fareast-language:EN-US;mso-bidi-language:AR-SA”>
<span style=“font-size:11.0pt;line-height:115%;
font-family:“Calibri”,“sans-serif”;mso-ascii-theme-font:minor-latin;mso-fareast-font-family:
Calibri;mso-fareast-theme-font:minor-latin;mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:“Times New Roman”;mso-bidi-theme-font:minor-bidi;
mso-ansi-language:EN-GB;mso-fareast-language:EN-US;mso-bidi-language:AR-SA”>Each run I split into characters looking for special control codes.
<span style=“font-size:11.0pt;line-height:115%;
font-family:“Calibri”,“sans-serif”;mso-ascii-theme-font:minor-latin;mso-fareast-font-family:
Calibri;mso-fareast-theme-font:minor-latin;mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:“Times New Roman”;mso-bidi-theme-font:minor-bidi;
mso-ansi-language:EN-GB;mso-fareast-language:EN-US;mso-bidi-language:AR-SA”>
<span style=“font-size:11.0pt;line-height:115%;
font-family:“Calibri”,“sans-serif”;mso-ascii-theme-font:minor-latin;mso-fareast-font-family:
Calibri;mso-fareast-theme-font:minor-latin;mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:“Times New Roman”;mso-bidi-theme-font:minor-bidi;
mso-ansi-language:EN-GB;mso-fareast-language:EN-US;mso-bidi-language:AR-SA”> The think I am struggling with is that a output of Paragraph.getText() gives me
? HYPERLINK “http://www.spectrags.com” ?www.spectrags.com? (? indicate fields )

but the combined run.getText() gives
HYPERLINK “http://www.spectrags.comwww.spectrags.com ( no fields)

I have obviously missed something :frowning:

Can anyone help please

Thanks

Jan

Hi Jan,


Thanks for your inquiry. Please note that FieldStart class represents a start of a Word field in a document. A complete field in a Microsoft Word document is a complex structure consisting of a field start character, field code, field separator character, field result and field end character. Some fields only have field start, field code and field end.

Paragraph.getText method gets the text of this paragraph including the end of paragraph character. The returned string includes all control and special characters as described in ControlChar. The Run.getText() returns the text of the run. All text of the document is stored in runs of text.

In your case, you can identify the Hyperlink field as shown in following code snippet.

<span style=“font-size:10.0pt;
font-family:“Courier New”;color:#2B91AF;mso-font-kerning:0pt;mso-ansi-language:
PL;mso-no-proof:yes”>Document<span style=“font-size:10.0pt;font-family:
“Courier New”;mso-font-kerning:0pt;mso-ansi-language:PL;mso-no-proof:yes”> doc
= new Document(MyDir

  • “HyperLink.docx”);<o:p></o:p>

NodeCollection fields = doc.getChildNodes(NodeType.FIELD_START, true);

for(FieldStart fStart : (Iterable<FieldStart>)fields)

{

if(fStart.getFieldType() == FieldType.FIELD_HYPERLINK)

{

//getFieldCode returns HYPERLINK "http://www.spectrags.com/"

System.out.println(getFieldCode(fStart));

Paragraph para = fStart.getParentParagraph();

//getText return HYPERLINK "http://www.spectrags.com/" _www.spectrags.com

System.out.println(para.getText());

//Paragraph.toString return www.spectrags.com

System.out.println(para.toString(SaveFormat.TEXT));

}

}

private static String getFieldCode(FieldStart fieldStart) throws Exception

{

StringBuilder builder = new StringBuilder();

for (Node node = fieldStart; node != null && node.getNodeType() != NodeType.FIELD_SEPARATOR &&

node.getNodeType() != NodeType.FIELD_END; node = node.nextPreOrder(node.getDocument()))

{

// Use text only of Run nodes to avoid duplication.

if (node.getNodeType() == NodeType.RUN)

builder.append(node.getText());

}

return builder.toString();

}

Please read following documentation links for your kind reference and let us know if you have any more queries.
http://www.aspose.com/docs/display/wordsjava/FieldStart
http://www.aspose.com/docs/display/wordsjava/FieldEnd


Hi,


If I have confused the issue then I apologise but your examples and indeed the documentation examples do not illustrate how I detect Fields when decoding Paragraph runs.
An example function that is passed a para node

private void processRuns(Paragraph para) {
RunCollection paraRuns = para.getRuns();
Iterator runList = paraRuns.iterator();
while ( runList.hasNext() ) {
Run thisRun = runList.next();
System.out.println("[" + thisRun.getText() + “]”);
}
}

This will not reveal any filed codes but System.out.println("[" + para.getText() + “]”); will?

My code needs to process a Paragraph’s text by its Runs as it has to compare Font between runs but I cannot see how the fields are detected ?

Cheers
Jan

Hi Jan,


Thanks for your inquiry. Please note that Paragraph.getText method gets the text of paragraph including the end of paragraph character. The returned string includes all control and special characters as described in ControlChar.

Could you please attach your input Word document here for testing? I will investigate the issue on my side and provide you more information.

Hi


Attached id the word file as per your request.

Thanks for your continued help

Jan

Hi Jan,


Thanks for sharing the document.
JanUrbanski:
The think I am struggling with is that a output of Paragraph.getText() gives me
? HYPERLINK “http://www.spectrags.com” ?www.spectrags.com? (? indicate fields )

but the combined run.getText() gives
HYPERLINK “http://www.spectrags.comwww.spectrags.com ( no fields)
Please note that Paragraph.getText method gets the text of this paragraph including the end of paragraph character and Run.getText return simple text.

The text of all child nodes is concatenated and the end of paragraph character is appended as follows:

  • If the paragraph is the last paragraph of Body, then ControlChar.SectionBreak (\x000c) is appended.
  • If the paragraph is the last paragraph of Cell, then ControlChar.Cell (\x0007) is appended.
  • For all other paragraphs ControlChar.ParagraphBreak (\r) is appended.

I have not found any
(? indicate fields ) by using following code snippet. Please see the output from attached image.



<span style=“font-size:10.0pt;
font-family:“Courier New”;color:#2B91AF;mso-font-kerning:0pt;mso-ansi-language:
PL;mso-no-proof:yes”>Document<span style=“font-size:10.0pt;font-family:
“Courier New”;mso-font-kerning:0pt;mso-ansi-language:PL;mso-no-proof:yes”> doc
= new Document(MyDir

  • “emphBug.docx”);<o:p></o:p>

NodeCollection paras = doc.getChildNodes(NodeType.PARAGRAPH, true);

for(Paragraph para : (Iterable<Paragraph>)paras)

{

System.out.println("Para : " + para.getText());

System.out.print("Run : ");

processRuns(para);

}

public static void processRuns(Paragraph para) {

RunCollection paraRuns = para.getRuns();

Iterator<Run> runList = paraRuns.iterator();

while ( runList.hasNext() ) {

Run thisRun = runList.next();

System.out.print(thisRun.getText());

}

}

In case you are using an older version of Aspose.Words, I would suggest you please upgrade to the latest version (v13.1.0) from here. If you still face problem, please share the following information for further investigation.

What environment are you running on?
  • OS (Windows Version or Linux Version)
  • Architecture (32 / 64 bit)
  • JDK version


Hi,
I think we have confused each other here
using para.getText() gives: -

Para : (0x13) HYPERLINK "http://www.spectrags.com" (0x14)www.spectrags.com(0x15)
Note: (0x13) (0x14) (0x15)


Iterating through the runs with getText() gives:
HYPERLINK "http://www.spectrags.com" www.spectrags.com
Note: no (0x13) (0x14) (0x15)

Is there a way to catch the control codes whilst Iterating through the runs with getText() ?
Hopefully this clears up the issue

Thanks

Jan

Windows 64 bit using Eclipse EE and jre7


Hi Jan,


Thanks for sharing the detail. No, Run.getText method do not contain the ControlChar (FIELD_START_CHAR, FIELD_END_CHAR, FIELD_SEPARATOR_CHAR).

The Paragraph.getText method gets the text of this paragraph including all control and special characters as described in ControlChar.

Moreover, a field in a Word document is a complex structure consisting of multiple nodes that include field start, field code, field separator, field result and field end. The Start, Separator and End properties point to the field start, separator and end nodes of the field respectively.