Extraction of text

Hi,


I start working with your Diagram jar, I need to extract all text from one .vdx or .vsd document.
Can You help me and tell me how I can do that?

I go throw Yours API, but it would helpful to me if You can give me an algorithm how I can extract all text…

I attached some document.

Thanks
Best regards

Hi Djordje,


Thank you for contacting support. Please use the sample code below to get text from a particular page of the Visio diagram. Similarly, we can iterate through the each page:

[Java]
// load Visio diagram
Diagram diagram = new Diagram(“C:/temp/SonyUSOnlineCF_Draft_20141218_Spanish.vdx”);

StringBuilder outText = new StringBuilder();

//Find a particular shape and update its text
for (com.aspose.diagram.Shape shape : (Iterable) diagram.getPages().getPage(“1_Intro”).getShapes())
{
String shapeText = shape.getText().getValue().getText();
outText.append(shapeText.replaceAll("\<.*?>",""));
}
System.out.println(outText);

We hope, this helps.

Hi,


thanks for answer, well I have understand from your API that I can extract on this way, but some things like shape.getShapes() inspire me to think that we can have grouped or nested shapes, so it would be helpful if You have to give some algorithm that cover and lat say “special cases” of extraction text.

Thanks,
Best regards

Hi,


one more thing, when I extract withn this algorithm I got output like this: Network, Network is text from diagram and that is ok but this tag I can not see in document, why is he showing?

Thanks

Hi Djordje,


Thank you for the inquiry. We can recursively get the text of all sub shapes too. Please use the following sample code:

[Java]
public class TestCls {
static String text = “”;
public static void main(String[] args) throws Exception
{
License licDiagram = new License();
licDiagram.setLicense(“C:\Aspose.Total.Java.lic”);
Diagram diagram = new Diagram(“C:/temp/SonyUSOnlineCF_Draft_20141218_Spanish.vdx”);

Page page = diagram.getPages().getPage(“1_Intro”);
for (com.aspose.diagram.Shape shape : (Iterable) page.getShapes())
{
GetShapeText(shape);
}
System.out.println(TestCls.text);
}
static void GetShapeText(Shape shape)
{
if (shape.getType() != TypeValue.GROUP)
TestCls.text += (shape.getText().getValue().getText().replaceAll("\<.*?>",""));
else
for(Shape subshape : (Iterable) shape.getShapes())
{
GetShapeText(subshape);
}
}
}
djordje:
one more thing, when I extract withn this algorithm I got output like this: Network, Network is text from diagram and that is ok but this tag I can not see in document, why is he showing?
Well, please note, the Shape element contains an element called Text, which contains the characters of the text and special elements (cp, pp, tp, and fld) that mark the end of one run and the beginning of the next. Char Element contains the formatting attributes for the shape’s text, such as font, color, text style, case, position relative to the baseline, and point size. In your case, we can strip these elements through the regular expression.

We hope, this helps.

Hi,


this was very helpful, thank you very much.

Best regards

Hi,


I have new question, because I can not save .vsd document I converted attached document to .vdx and than extracted text using algorithm that You gave me few posts early (which works for .vdx documents that I tested for now), but for attached document it doesn’t work. Most of the text isn’t extracted.
Do You know why, can You help me?

Hi Djordje,

Thank you for contacting support. It is because most of the group shapes are protected at Shape Sheet level. Please check this help topic: http://support.microsoft.com/kb/305343?wa=wsignin1.0

However, we have also noticed that the getPage method of Diagram class throws a null pointer error when passing page name as parameter. It works fine in case of page index. We have logged this issue under ticket id DIAGRAMJAVA-50140 in our issue tracking system. We’ll keep you informed regarding any available updates. We’re sorry for the inconvenience caused.

Please feel free to reply us in case of any confusion or questions.

Hi Djordje,


Thank you for being patient. We have a good news for you that the issue id DIAGRAMJAVA-50140 has now been resolved. If there is no issue in the QA phase, then this fix will be included in the next version of Aspose.Diagram for Java 5.1.0. We’ll inform you via this forum thread as soon as the new release is published.

Hi,

I need help with text from stencil shapes. I attached document with stencil shapes, I need to extract text from them, is it possible and how I can do that?

Thanks
Best regards

Hi Djordje,


Thank you for the inquiry. Please use this sample code below:

[Java]
public class TestCls {
static String text = “”;
public static void main(String[] args) throws Exception
{
License licDiagram = new License();
licDiagram.setLicense(“C:\Aspose.Total.Java.lic”);
Diagram diagram = new Diagram(“C:/temp/dr_qvs_DeploymentSource1.vsd”);

Page page = diagram.getPages().getPage(“Single server setup”);
for (com.aspose.diagram.Shape shape : (Iterable) page.getShapes())
{
GetShapeText(shape);
}
System.out.println(TestCls.text);
}

static void GetShapeText(Shape shape)
{
TestCls.text += (shape.getText().getValue().getText().replaceAll("\<.*?>",""));

// for image shapes
if (shape.getType() == TypeValue.FOREIGN)
TestCls.text += (shape.getName());

// for group shapes
if (shape.getType() == TypeValue.GROUP)
for(Shape subshape : (Iterable) shape.getShapes())
{
GetShapeText(subshape);
}
}
}

Please feel free to reply us in case of any confusion or questions.

The issues you have found earlier (filed as DIAGRAMJAVA-50140) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

Hi,


I have few new bugs reported to me with before mentioned extraction of text from protected shapes.
Can You please, somehow, find a way so I can extract this text without ungroupping and unlocking shapes?
Thanks,
Best regards

Hi Djordje,


Thank you for posting problematic Visio diagram. We’re sorry to share with you that we could not unzip it. It looks a damaged zip file. Please recheck and attach again in the forum thread. We’ll start investigating it once it is available to us.

Hi,

here it is again.

Thanks

Hi Djordje,


Thank you for contacting support. Did you test against the latest version of Aspose.Diagram for Java 5.1.0? We tested against the sample code which we shared in our earlier post there. It works perfectly and we can extract all items of the shape’s text. We have attached output text file for your reference. If you think it is missing some text, then please share its detail. We’ll check and answer you accordingly.