Hello, I would like to know how to achieve this if I need to browse all paragraphs of the entire file?
@supeiwei You should simply do the same in the loop:
Document doc = new Document(@"C:\Temp\in.docx");
int tmpBkIndex = 0;
foreach (Paragraph p in doc.GetChildNodes(NodeType.Paragraph, true))
{
// LayoutCollector and LayoutEnumerator work with nodes only in the main document body.
if (p.GetAncestor(NodeType.HeaderFooter) != null)
continue;
if (p.GetAncestor(NodeType.Shape) != null)
continue;
string tmpBkName = $"_tmp_{tmpBkIndex++}";
// Wrap the paragraph into a temporary bookmark.
p.PrependChild(new BookmarkStart(doc, tmpBkName));
p.AppendChild(new BookmarkEnd(doc, tmpBkName));
}
// Use LayoutCollector and LayoutEnumerator to calculate paragraph bounds.
LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);
foreach (Bookmark bk in doc.Range.Bookmarks)
{
if (!bk.Name.StartsWith("_tmp"))
continue;
enumerator.Current = collector.GetEntity(bk.BookmarkStart);
while (enumerator.Type != LayoutEntityType.Line)
enumerator.MoveParent();
// Get the rectangle occuped by the first line of the paragraph.
RectangleF rect1 = enumerator.Rectangle;
// Now do the same woth the last line.
enumerator.Current = collector.GetEntity(bk.BookmarkEnd);
while (enumerator.Type != LayoutEntityType.Line)
enumerator.MoveParent();
RectangleF rect2 = enumerator.Rectangle;
// Union of the rectangles is the region occuped by the paragraph.
RectangleF result = RectangleF.Union(rect1, rect2);
// For demonstraction purposes draw rectangle shape around the paragraph.
Shape rectShape = new Shape(doc, ShapeType.Rectangle);
rectShape.StrokeColor = Color.Red;
rectShape.StrokeWeight = 1;
rectShape.Filled = false;
rectShape.Width = result.Width;
rectShape.Height = result.Height;
rectShape.WrapType = WrapType.None;
rectShape.RelativeHorizontalPosition = RelativeHorizontalPosition.Page;
rectShape.RelativeVerticalPosition = RelativeVerticalPosition.Page;
rectShape.Left = result.Left;
rectShape.Top = result.Top;
bk.BookmarkEnd.ParentNode.AppendChild(rectShape);
}
doc.Save(@"C:\Temp\out.docx");
Can you help me implement it in Java? When I converted it to Java, I couldnât save the file and reported the error âException in thread âmainâ java.lang.IllegalStateException: A value with the specified key has already been added.â
@supeiwei Sure, here is Java version of the code:
Document doc = new Document("C:\\Temp\\in.docx");
int tmpBkIndex = 0;
for (Paragraph p : (Iterable<Paragraph>)doc.getChildNodes(NodeType.PARAGRAPH, true))
{
// LayoutCollector and LayoutEnumerator work with nodes only in the main document body.
if (p.getAncestor(NodeType.HEADER_FOOTER) != null)
continue;
if (p.getAncestor(NodeType.Shape) != null)
continue;
String tmpBkName = "_tmp_" + tmpBkIndex;
tmpBkIndex++;
// Wrap the paragraph into a temporary bookmark.
p.prependChild(new BookmarkStart(doc, tmpBkName));
p.appendChild(new BookmarkEnd(doc, tmpBkName));
}
// Use LayoutCollector and LayoutEnumerator to calculate paragraph bounds.
LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);
for (Bookmark bk : doc.getRange().getBookmarks())
{
if (!bk.getName().startsWith("_tmp"))
continue;
enumerator.setCurrent(collector.getEntity(bk.getBookmarkStart()));
while (enumerator.getType() != LayoutEntityType.LINE)
enumerator.moveParent();
// Get the rectangle occuped by the first line of the paragraph.
Rectangle2D rect1 = enumerator.getRectangle();
// Now do the same woth the last line.
enumerator.setCurrent(collector.getEntity(bk.getBookmarkEnd()));
while (enumerator.getType() != LayoutEntityType.LINE)
enumerator.moveParent();
Rectangle2D rect2 = enumerator.getRectangle();
// Union of the rectangles is the region occuped by the paragraph.
Rectangle2D result = rect1.createUnion(rect2);
// For demonstraction purposes draw rectangle shape around the paragraph.
Shape rectShape = new Shape(doc, ShapeType.RECTANGLE);
rectShape.setStrokeColor(Color.RED);
rectShape.setStrokeWeight(1);
rectShape.setFilled(false);
rectShape.setWidth(result.getWidth());
rectShape.setHeight(result.getHeight());
rectShape.setWrapType(WrapType.NONE);
rectShape.setRelativeHorizontalPosition(RelativeHorizontalPosition.PAGE);
rectShape.setRelativeVerticalPosition(RelativeVerticalPosition.PAGE);
rectShape.setLeft(result.getX());
rectShape.setTop(result.getY());
bk.getBookmarkEnd().getParentNode().appendChild(rectShape);
}
doc.save("C:\\Temp\\out.docx");
Iâm sorry, the code still reports an error. The error result is âException in thread âmainâ java.lang.IllegalStateException: A value with the specified key has already been added.â, which occurs in the file saving code.
@supeiwei Could you please attach your input problematic document here for testing? We will check the issue and provide you more information.
I found that it is a structural problem in some documents, but there is also a problem that cross-page paragraphs cannot correctly obtain the rectangular coordinates and visualization. Can you help me solve it?
When a paragraph spans two pages, I want to get the text and rectangle on the first page of the paragraph and the text and rectangle on the second page of the paragraph, that is, get them separately.
@supeiwei Yes, the code provided above works only for paragraph that are fully placed on a single page. You can use LayoutCollector.getNumPagesSpanned
method to determine number of pages spanned by the node.
The following modified code also handles paragraph that span two pages:
Document doc = new Document("C:\\Temp\\in.docx");
int tmpBkIndex = 0;
for (Paragraph p : (Iterable<Paragraph>)doc.getChildNodes(NodeType.PARAGRAPH, true))
{
// LayoutCollector and LayoutEnumerator work with nodes only in the main document body.
if (p.getAncestor(NodeType.HEADER_FOOTER) != null)
continue;
if (p.getAncestor(NodeType.SHAPE) != null)
continue;
String tmpBkName = "_tmp_" + tmpBkIndex;
tmpBkIndex++;
// Wrap the paragraph into a temporary bookmark.
p.prependChild(new BookmarkStart(doc, tmpBkName));
p.appendChild(new BookmarkEnd(doc, tmpBkName));
}
// Use LayoutCollector and LayoutEnumerator to calculate paragraph bounds.
LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);
for (Bookmark bk : doc.getRange().getBookmarks())
{
if (!bk.getName().startsWith("_tmp"))
continue;
Paragraph parentParagraph = (Paragraph)bk.getBookmarkEnd().getParentNode();
// The paragraph is fully placed on a single page.
if (collector.getNumPagesSpanned(parentParagraph) == 0)
{
enumerator.setCurrent(collector.getEntity(bk.getBookmarkStart()));
while (enumerator.getType() != LayoutEntityType.LINE)
enumerator.moveParent();
// Get the rectangle occuped by the first line of the paragraph.
Rectangle2D rect1 = enumerator.getRectangle();
// Now do the same woth the last line.
enumerator.setCurrent(collector.getEntity(bk.getBookmarkEnd()));
while (enumerator.getType() != LayoutEntityType.LINE)
enumerator.moveParent();
Rectangle2D rect2 = enumerator.getRectangle();
// Union of the rectangles is the region occuped by the paragraph.
Rectangle2D result = rect1.createUnion(rect2);
System.out.println(result);
}
else
{
System.out.println("Paragraph occupies more than one page.");
// If paragraph spans more than one page you can get lines coordinates of the paragraph.
// Process the first part of the paragraph.
enumerator.setCurrent(collector.getEntity(bk.getBookmarkStart()));
while (enumerator.getType() != LayoutEntityType.LINE)
enumerator.moveParent();
Rectangle2D rect1 = enumerator.getRectangle();
while (enumerator.moveNext())
{
if (enumerator.getType() != LayoutEntityType.LINE)
break;
}
Rectangle2D rect2 = enumerator.getRectangle();
// Union of the rectangles is the region occuped by the paragraph.
Rectangle2D result = rect1.createUnion(rect2);
System.out.println(result);
// Process the second part of the paragraph.
enumerator.setCurrent(collector.getEntity(bk.getBookmarkEnd()));
while (enumerator.getType() != LayoutEntityType.LINE)
enumerator.moveParent();
rect1 = enumerator.getRectangle();
while (enumerator.movePrevious())
{
if (enumerator.getType() != LayoutEntityType.LINE)
break;
}
rect2 = enumerator.getRectangle();
// Union of the rectangles is the region occuped by the paragraph.
result = rect1.createUnion(rect2);
System.out.println(result);
}
}
doc.save("C:\\Temp\\out.docx");
For more complex cases when paragraph spans more than 2 page additional processing is required.
Thank you very much, but I also want to divide the text content of the cross-page paragraph into two parts. How should I get it?
@supeiwei In this case you have to work with Run
nodes in the paragraph. For example the following code inserts a paragraph break at the end of the page to split paragraph into two parts.
Document doc = new Document("C:\\Temp\\in.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
// Split all Run nodes in the paragraph to make them not more than one word.
Node[] runs = doc.getChildNodes(NodeType.RUN, true).toArray();
for (Node r : runs)
{
Run current = (Run)r;
while (current.getText().indexOf(' ') >= 0)
current = SplitRun(current, current.getText().indexOf(' ') + 1);
}
// Now we can use collector and enumerator to get runs per line in MS Word document.
LayoutCollector collector = new LayoutCollector(doc);
for (Paragraph p : (Iterable<Paragraph>)doc.getChildNodes(NodeType.PARAGRAPH, true))
{
// LayoutCollector and LayoutEnumerator work with nodes only in the main document body.
if (p.getAncestor(NodeType.HEADER_FOOTER) != null)
continue;
if (p.getAncestor(NodeType.SHAPE) != null)
continue;
if (collector.getNumPagesSpanned(p) > 0)
{
int currentPage = collector.getStartPageIndex(p);
for (Run r : (Iterable<Run>)p.getChildNodes(NodeType.RUN, true))
{
int runPage = collector.getStartPageIndex(r);
if (runPage > currentPage)
{
builder.moveTo(r.getPreviousSibling());
builder.writeln();
currentPage = runPage;
}
}
}
}
doc.save("C:\\Temp\\out.docx");
private static Run SplitRun(Run run, int position)
{
Run afterRun = (Run)run.deepClone(true);
run.getParentNode().insertAfter(afterRun, run);
afterRun.setText(run.getText().substring(position));
run.setText(run.getText().substring(0, position));
return afterRun;
}
Okay, thank you very much, but I found that when getting the paragraph text, if there is a number in the link format in the text, the number will be garbled.
@supeiwei Could you please elaborate the problem and provide your input, output and expected output along with code that will allow us to reproduce the problem?
Like this picture, what I want to get is the number and text content of this paragraph and its rectangular coordinates. But the program output is âthe date on which the Available Facility has been terminated in accordance with Clause REF _Ref57226042 \w \h * MERGEFORMAT 2.6(b) (Cancellation of Available Commitment).â In this way, it cannot get it to the paragraph number, and garbled characters appear within the paragraph.
@supeiwei The question is already answered here:
https://forum.aspose.com/t/how-to-get-the-number-and-text-content-together-when-parsing-word-paragraph-text/293659/4
Thanks. Another problem is that the coordinates of the paragraph rectangle obtained are inaccurate. Do you know why?
@supeiwei The problem might occur because fonts used in your document are not available in the environment where you process the document.
As you may know, MS Word documents are flow documents and do not contain any information about document layout. The consumer applications, like MS Word or Open Office builds document layout on the fly. Aspose.Words uses itâs own layout engine to build document layout while rendering the document to fixed page formats (PDF, XPS, Image etc.). The same layout engine is used for providing document layout information via LayoutCollector
and LayoutEnumerator
classes.
To built proper document layout the fonts used in the original document are required. If Aspose.Words cannot find the fonts used in the document the fonts are substituted . This might lead into the layout difference (incorrect coordinates returned by LayoutEnumerator
), since substitution fonts might have different font metrics. You can implement IWarningCallback to get a notification when font substitution is performed.
I found that this is not the problem, but that the last line of some paragraphs is full, but the program will regard the next line of the paragraph as the end line of the paragraph, causing the height of the rectangle of this paragraph to become higher, and subsequent paragraphs will be affected by this. How to solve this problem
@supeiwei Could you please attach the problematic input document here for testing? We will check the issue and provide you more information.
Iâm sorry, the document is a company secret and cannot be shared outside.