Extract text between paragraphs

Hi team,


With regard to extract content between paragraphs, I am able to get total paragraph count.
My requirement is to get the paragraph index/node of all shapes in docx file.

<div class=“line number4 index3 alt1” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace; font-size: 14px; white-space: pre-wrap; color: rgb(51, 51, 51); border-radius: 0px !important; background: none rgb(255, 255, 255) !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px 1em 0px 0em !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;”><code class=“csharp comments” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important; color: rgb(0, 130, 0) !important;”>// Gather the nodes. The GetChild method uses 0-based index
<div class=“line number5 index4 alt2” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace; font-size: 14px; white-space: pre-wrap; color: rgb(51, 51, 51); border-radius: 0px !important; background: none rgb(255, 255, 255) !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px 1em 0px 0em !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;”><code class=“csharp plain” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important; color: rgb(0, 0, 0) !important;”>Paragraph startPara = (Paragraph)doc.getFirstSection().getChild(NodeType.PARAGRAPH, 6, <code class=“csharp keyword” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; font-weight: bold !important; min-height: auto !important; color: rgb(51, 102, 153) !important;”>true<code class=“csharp plain” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important; color: rgb(0, 0, 0) !important;”>);<div class=“line number6 index5 alt1” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace; font-size: 14px; white-space: pre-wrap; color: rgb(51, 51, 51); border-radius: 0px !important; background: none rgb(255, 255, 255) !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px 1em 0px 0em !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;”><code class=“csharp plain” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important; color: rgb(0, 0, 0) !important;”>Paragraph endPara = (Paragraph)doc.getFirstSection().getChild(NodeType.PARAGRAPH, 10, <code class=“csharp keyword” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; font-weight: bold !important; min-height: auto !important; color: rgb(51, 102, 153) !important;”>true<code class=“csharp plain” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important; color: rgb(0, 0, 0) !important;”>);<div class=“line number7 index6 alt2” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace; font-size: 14px; white-space: pre-wrap; color: rgb(51, 51, 51); border-radius: 0px !important; background: none rgb(255, 255, 255) !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px 1em 0px 0em !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;”><code class=“csharp comments” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important; color: rgb(0, 130, 0) !important;”>// Extract the content between these nodes in the document. Include these markers in the extraction.<div class=“line number8 index7 alt1” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace; font-size: 14px; white-space: pre-wrap; color: rgb(51, 51, 51); border-radius: 0px !important; background: none rgb(255, 255, 255) !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px 1em 0px 0em !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;”><code class=“csharp plain” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important; color: rgb(0, 0, 0) !important;”>ArrayList extractedNodes = extractContent(startPara, endPara, <code class=“csharp keyword” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; font-weight: bold !important; min-height: auto !important; color: rgb(51, 102, 153) !important;”>true<code class=“csharp plain” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important; color: rgb(0, 0, 0) !important;”>);<div class=“line number8 index7 alt1” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace; font-size: 14px; white-space: pre-wrap; color: rgb(51, 51, 51); border-radius: 0px !important; background: none rgb(255, 255, 255) !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px 1em 0px 0em !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;”><code class=“csharp plain” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important; color: rgb(0, 0, 0) !important;”>
<div class=“line number8 index7 alt1” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace; font-size: 14px; white-space: pre-wrap; color: rgb(51, 51, 51); border-radius: 0px !important; background: none rgb(255, 255, 255) !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px 1em 0px 0em !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;”><code class=“csharp plain” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important; color: rgb(0, 0, 0) !important;”>In the above code, I need start and end nodes to extract shapes(images, charts) in docx file.<div class=“line number8 index7 alt1” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace; font-size: 14px; white-space: pre-wrap; color: rgb(51, 51, 51); border-radius: 0px !important; background: none rgb(255, 255, 255) !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px 1em 0px 0em !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;”><code class=“csharp plain” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important; color: rgb(0, 0, 0) !important;”>
<div class=“line number8 index7 alt1” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace; font-size: 14px; white-space: pre-wrap; color: rgb(51, 51, 51); border-radius: 0px !important; background: none rgb(255, 255, 255) !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px 1em 0px 0em !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;”><code class=“csharp plain” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important; color: rgb(0, 0, 0) !important;”>Regards<div class=“line number8 index7 alt1” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace; font-size: 14px; white-space: pre-wrap; color: rgb(51, 51, 51); border-radius: 0px !important; background: none rgb(255, 255, 255) !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px 1em 0px 0em !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;”><code class=“csharp plain” style=“font-family: Consolas, “Bitstream Vera Sans Mono”, “Courier New”, Courier, monospace !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 20px !important; margin: 0px !important; outline: 0px !important; overflow: visible !important; padding: 0px !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important; color: rgb(0, 0, 0) !important;”>Priya Dharshini J P

Hi Priya,


Thanks for your inquiry. The image and chart are imported as Shape node in Aspose.Words DOM. After extracting the contents, please use Document.GetChildNodes method to extract the Shape nodes from the document.

If you want to get the index of specified node, please use NodeCollection.indexOf method.

<pre style=“background-color: rgb(255, 255, 255); font-family: “Courier New”; font-size: 9pt;”>NodeCollection allShapes = doc.getChildNodes(NodeType.SHAPE, true);
int shapeIndex = allShapes.indexOf(shape);

If you still face problem, please share your input document and your desired output. We will then provide you more information about your query along with code.

Hi tahir,


Thank you for the quick response. But I request the immediate paragraph count before and after the shape.
I want to extract the shape (image) and image caption after the shape using paragraph index. Is it possible? We have an urgent requirement with extraction of all shapes with the figure caption. Hence I require paragraph index and not shape index.

Thank you
Priya
Hi Priya,

Thanks for sharing the detail. The figure caption (text after the image) is inside separate Paragraph node. Please use following code example to get the desired output. Hope this helps you.

Document doc = new Document(MyDir + "Bernardo_et_al_RevisedPaper.docx");
int i = 1;
NodeCollection shapes = doc.getChildNodes(NodeType.SHAPE, true);
for (Shape shape : (Iterable) shapes)
{
if(shape.hasImage() && shape.getParentParagraph().getNextSibling() != null
&& shape.getParentParagraph().getNextSibling().getNodeType() == NodeType.PARAGRAPH)
{
if(shape.getParentParagraph().getNextSibling().toString(SaveFormat.TEXT).startsWith("Fig"))
{System.out.println(shape.getParentParagraph().getNextSibling().toString(SaveFormat.TEXT));
ArrayList nodes = extractContent(shape.getParentParagraph(), shape.getParentParagraph().getNextSibling(), true);
generateDocument(doc, nodes).save(MyDir + "output "+ i + ".docx");
i++;
}
}
}

Thank you so much Tahir and Aspose.

Aspose always surprises with it’s solutions.
I am very thankful for the quick and timely response.
You people have excellent solution to any problem.
Keep going.

Regards
Priya dharshini J P

Hi tahir,


I further have a request to delete/remove the extracted content from the source word document to perform further processes.

Can you provide a workaround solution to remove content from start to end node extracted using above code from source word document?
Hi Priya,

Thanks for your inquiry. We have noticed that you want to extract the contents from the document and remove them also. We suggest you please bookmark the contents that you want to remove or extract from document.

Following code example shows how to bookmark the contents and remove them. Hope this helps you.

Document doc = new Document(MyDir + "Bernardo_et_al_RevisedPaper.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
int i = 1;
NodeCollection shapes = doc.getChildNodes(NodeType.SHAPE, true);
for (Shape shape : (Iterable) shapes)
{
if(shape.hasImage() && shape.getParentParagraph().getNextSibling() != null
&& shape.getParentParagraph().getNextSibling().getNodeType() == NodeType.PARAGRAPH)
{
if(shape.getParentParagraph().getNextSibling().toString(SaveFormat.TEXT).startsWith("Fig"))
{
Paragraph fig = (Paragraph)shape.getParentParagraph().getNextSibling();
shape.getParentParagraph().insertBefore(new BookmarkStart(doc, "Image_"+i), shape);
fig.appendChild(new BookmarkEnd(doc, "Image_"+i));
i++;
}
}
}

for(Bookmark bookmark : doc.getRange().getBookmarks())
{
if(bookmark.getName().startsWith("Image_"))
{
bookmark.setText("");
}
}
doc.save(MyDir + "output.docx");

Thanking You

Regards
Priya Dharshini J P