Thank you very much, but I also want to divide the text content of the cross-page paragraph into two parts. How should I get it?
@supeiwei In this case you have to work with Run
nodes in the paragraph. For example the following code inserts a paragraph break at the end of the page to split paragraph into two parts.
Document doc = new Document("C:\\Temp\\in.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
// Split all Run nodes in the paragraph to make them not more than one word.
Node[] runs = doc.getChildNodes(NodeType.RUN, true).toArray();
for (Node r : runs)
{
Run current = (Run)r;
while (current.getText().indexOf(' ') >= 0)
current = SplitRun(current, current.getText().indexOf(' ') + 1);
}
// Now we can use collector and enumerator to get runs per line in MS Word document.
LayoutCollector collector = new LayoutCollector(doc);
for (Paragraph p : (Iterable<Paragraph>)doc.getChildNodes(NodeType.PARAGRAPH, true))
{
// LayoutCollector and LayoutEnumerator work with nodes only in the main document body.
if (p.getAncestor(NodeType.HEADER_FOOTER) != null)
continue;
if (p.getAncestor(NodeType.SHAPE) != null)
continue;
if (collector.getNumPagesSpanned(p) > 0)
{
int currentPage = collector.getStartPageIndex(p);
for (Run r : (Iterable<Run>)p.getChildNodes(NodeType.RUN, true))
{
int runPage = collector.getStartPageIndex(r);
if (runPage > currentPage)
{
builder.moveTo(r.getPreviousSibling());
builder.writeln();
currentPage = runPage;
}
}
}
}
doc.save("C:\\Temp\\out.docx");
private static Run SplitRun(Run run, int position)
{
Run afterRun = (Run)run.deepClone(true);
run.getParentNode().insertAfter(afterRun, run);
afterRun.setText(run.getText().substring(position));
run.setText(run.getText().substring(0, position));
return afterRun;
}
Okay, thank you very much, but I found that when getting the paragraph text, if there is a number in the link format in the text, the number will be garbled.
@supeiwei Could you please elaborate the problem and provide your input, output and expected output along with code that will allow us to reproduce the problem?
Like this picture, what I want to get is the number and text content of this paragraph and its rectangular coordinates. But the program output is ‘the date on which the Available Facility has been terminated in accordance with Clause REF _Ref57226042 \w \h * MERGEFORMAT 2.6(b) (Cancellation of Available Commitment).’ In this way, it cannot get it to the paragraph number, and garbled characters appear within the paragraph.
@supeiwei The question is already answered here:
https://forum.aspose.com/t/how-to-get-the-number-and-text-content-together-when-parsing-word-paragraph-text/293659/4
Thanks. Another problem is that the coordinates of the paragraph rectangle obtained are inaccurate. Do you know why?
@supeiwei The problem might occur because fonts used in your document are not available in the environment where you process the document.
As you may know, MS Word documents are flow documents and do not contain any information about document layout. The consumer applications, like MS Word or Open Office builds document layout on the fly. Aspose.Words uses it’s own layout engine to build document layout while rendering the document to fixed page formats (PDF, XPS, Image etc.). The same layout engine is used for providing document layout information via LayoutCollector
and LayoutEnumerator
classes.
To built proper document layout the fonts used in the original document are required. If Aspose.Words cannot find the fonts used in the document the fonts are substituted . This might lead into the layout difference (incorrect coordinates returned by LayoutEnumerator
), since substitution fonts might have different font metrics. You can implement IWarningCallback to get a notification when font substitution is performed.
I found that this is not the problem, but that the last line of some paragraphs is full, but the program will regard the next line of the paragraph as the end line of the paragraph, causing the height of the rectangle of this paragraph to become higher, and subsequent paragraphs will be affected by this. How to solve this problem
@supeiwei Could you please attach the problematic input document here for testing? We will check the issue and provide you more information.
I’m sorry, the document is a company secret and cannot be shared outside.
@supeiwei Unfortunately, without the problematic document it is impossible to analyze the problem. I will move your messages to a separate topic. It is safe to attach documents in the forum, only topic starter and Aspose staff can access the attachments.
I found the reason. It is because of the page margin settings, which may be different from the default page margins of Aspose. Can you tell me how to adjust the page margins to suit my own files?
@supeiwei Aspose.Words uses page margins specified in the source file. It does not uses default margins.
test.docx (14.1 KB)
Like this file, the rectangular frame obtained by the first paragraph does not match the actual one. Its rectangular height is higher than the actual one.
@supeiwei Thank you for additional information. As I can see paragraph rectangles are detected properly:
out.docx (11.8 KB)
No, the paragraph rectangle is not detected correctly. The height of the rectangle of paragraph (a) is obviously too high and does not correctly frame paragraph (a). Can you help me solve this problem?
@supeiwei The frame is correct since it also includes the paragraph’s spacings. If you select the paragraph’s content in MS Word you will see this: