We are working on a use case where we extract yellow highlighted text from the document. However the segmentation of text is inaccurate, leading to break lines within sentences. For example, a sentence like “The quick brown fox jumped over the fence, and chased the sheep” is segmented incorrectly as:
“The quick brown fox jumped
over the fence, and chased the
sheep”
CODE
public static String ExtractHighlightedTextFromDocx(String docxFilePath){
// Load the document
String fullPath="/tmp/output_aspose.txt";
try {
// Load the document
com.aspose.words.Document doc = new com.aspose.words.Document("/tmp/"+docxFilePath);
PrintWriter writer = new PrintWriter(new FileWriter(fullPath));
FieldCollection fields = doc.getRange().getFields();
// Iterate through the Field collection and remove any fields that have the type "ADDIN EN.CITE".
for (com.aspose.words.Field field : fields) {
field.unlink();
}
// Get all runs from the document
NodeCollection<Run> runs = doc.getChildNodes(NodeType.RUN, true);
// Iterate through all runs
for (Run run : runs) {
// Check if the run is highlighted
if (run.getFont().getHighlightColor() != null && run.getFont().getHighlightColor().equals(java.awt.Color.YELLOW))
{
// Print the highlighted text
//System.out.print(run.getText());
String text = run.getText().trim();
writer.write(text + "\n");
}
}
writer.close();
} catch (Exception e) {
System.out.println(e.getMessage());
}
return fullPath;
}
The paragraph-level extraction seems to be missing while using this feature.