Challenge in extracting highlighted text: Split line issue

cacglo · May 9, 2024, 5:48am

We are working on a use case where we extract yellow highlighted text from the document. However the segmentation of text is inaccurate, leading to break lines within sentences. For example, a sentence like “The quick brown fox jumped over the fence, and chased the sheep” is segmented incorrectly as:

“The quick brown fox jumped
over the fence, and chased the
sheep”

CODE

public static String ExtractHighlightedTextFromDocx(String docxFilePath){
        // Load the document
        String fullPath="/tmp/output_aspose.txt";
        try {
            // Load the document
            com.aspose.words.Document doc = new com.aspose.words.Document("/tmp/"+docxFilePath);

            PrintWriter writer = new PrintWriter(new FileWriter(fullPath));
            FieldCollection fields = doc.getRange().getFields();

               // Iterate through the Field collection and remove any fields that have the type "ADDIN EN.CITE".
               for (com.aspose.words.Field field : fields) {
                    field.unlink();
               }

            // Get all runs from the document
            NodeCollection<Run> runs = doc.getChildNodes(NodeType.RUN, true);

            // Iterate through all runs
            for (Run run : runs) {
                // Check if the run is highlighted
                if (run.getFont().getHighlightColor() != null && run.getFont().getHighlightColor().equals(java.awt.Color.YELLOW)) 
                {
                    // Print the highlighted text
                    //System.out.print(run.getText());
                    String text = run.getText().trim();
                    writer.write(text + "\n");
                }
                
            }
            writer.close();
        
        } catch (Exception e) {
            System.out.println(e.getMessage());
        }
        return fullPath;
    }

The paragraph-level extraction seems to be missing while using this feature.

alexey.noskov · May 9, 2024, 11:35am

@cacglo It is not mandatory that text in sentence be represented as a single Run even if whole text has the same formatting. You can try using Document.joinRunsWithSameFormatting method to join runs with same formatting in all paragraphs of the document.

cacglo · May 21, 2024, 3:42am

How can I use this in the code?

alexey.noskov · May 21, 2024, 4:30am

@cacglo Just call this method after loading document and before further processing of the document.