Suppose this is the text I need to extract from a document
NOW THEREFORE the parties agree as follows:
1. All information disclosed by one party to the other to evaluate the Project and/or Purpose that is designated in writing as “Confidential” at the time of disclosure or if disclosed orally is designated in writing as “Confidential” within fifteen (15) days of disclosure is “Confidential Information.” Confidential Information does not include information which:
(a) was known by the receiving party prior to receipt from the disclosing party;
(b) is generally available in the public domain or thereafter becomes available to the public through no act of the receiving party;
(c) is independently discovered by an employee, agent, or representative of the receiving party who had no knowledge of the Confidential Information disclosed; or
(d) is made available to the receiving party as a matter of lawful right by a third party.
how do I go about this with Aspose for words for Java? how do I determine the listLabels and listText together to fetch this sort of a point from a document.
@ianjwilson
To extract an entire point along with its subpoints from a document using Aspose.Words for Java, you can utilize the document’s structure to identify the main list item and its associated subpoints. Here’s a clear and structured approach to achieve this:
Cause:
When working with Word documents, it is often necessary to extract structured information, such as list items and their subpoints, for better data manipulation or analysis.
Solution:
You can achieve this by following these steps:
- Load the Document: Start by loading your Word document using the
Document
class.
- Iterate Through Paragraphs: Loop through the paragraphs in the document to find the main list item (e.g., numbered list).
- Check for List Items: For each paragraph, check if it is part of a list by examining the paragraph’s list format.
- Extract Text: Once you identify a main list item, extract its text and look for its subpoints (sub-items) by checking the indentation level of subsequent paragraphs.
Here’s a sample code snippet to illustrate this:
import com.aspose.words.*;
public class ExtractListItems {
public static void main(String[] args) throws Exception {
// Load the document
Document doc = new Document("YourDocument.docx");
// Iterate through all paragraphs
for (Paragraph para : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
// Check if the paragraph is a list item
if (para.getListFormat().getList() != null) {
// Print the main list item
System.out.println("Main Item: " + para.getText().trim());
// Check for subpoints
int currentLevel = para.getListFormat().getListLevelNumber();
for (Node nextNode = para.getNextSibling(); nextNode != null; nextNode = nextNode.getNextSibling()) {
if (nextNode.getNodeType() == NodeType.PARAGRAPH) {
Paragraph subPara = (Paragraph) nextNode;
if (subPara.getListFormat().getList() != null &&
subPara.getListFormat().getListLevelNumber() > currentLevel) {
// Print subpoints
System.out.println(" Subpoint: " + subPara.getText().trim());
} else {
// Stop if we reach a different level
break;
}
}
}
}
}
}
}
Next Steps:
- Implement the provided code in your Java environment.
- Replace
"YourDocument.docx"
with the path to your actual Word document.
- Run the code to extract the main list items and their subpoints.
Closing Line:
This approach should effectively help you extract the entire point along with its subpoints. If you have any further questions or need additional assistance, feel free to ask!
shows all the points as main items:
Main Item: All information disclosed by one party to the other to evaluate the Project and/or Purpose that is designated in writing as “Confidential” at the time of disclosure or if disclosed orally is designated in writing as “Confidential” within fifteen (15) days of disclosure is “Confidential Information.” Confidential Information does not include information which:
Main Item: was known by the receiving party prior to receipt from the disclosing party;
Main Item: is generally available in the public domain or thereafter becomes available to the public through no act of the receiving party;
Main Item: is independently discovered by an employee, agent, or representative of the receiving party who had no knowledge of the Confidential Information disclosed; or
Main Item: is made available to the receiving party as a matter of lawful right by a third party.
why is that, is it because of how the document was created? @Professionalize.Discourse
@ianjwilson Could you please attach your input document and expected output? We will check your document and provide you more information.
input.docx (6.9 KB)
There is no output docx, I just need to extract those points and subpoints as part of this larger doc since they belong to the same point in the ordered list. And do just that everywhere, look for ordered list and look for sub points or points within the ordered list and extract the as a whole.\
Desired output from the doc is the one I shared above
1. All information disclosed by one party to the other to evaluate the Project and/or Purpose that is designated in writing as “Confidential” at the time of disclosure or if disclosed orally is designated in writing as “Confidential” within fifteen (15) days of disclosure is “Confidential Information.” Confidential Information does not include information which:
(a) was known by the receiving party prior to receipt from the disclosing party;
(b) is generally available in the public domain or thereafter becomes available to the public through no act of the receiving party;
(c) is independently discovered by an employee, agent, or representative of the receiving party who had no knowledge of the Confidential Information disclosed; or
(d) is made available to the receiving party as a matter of lawful right by a third party.
@ianjwilson In your document, the items are not actually list items. They are simple text. So you can loop through paragraphs in your document and check whether their text starts with “list label like” text.
If they were, how would one go get it?
@ianjwilson You can detect list items using Paragraph.IsListItem
property. Then you can use Paragraph.ListFormat
properties to get list formatting of the item.