Issue while extracting paras and html from word document

ashu_agrawal_sirionlabs_com · April 29, 2024, 3:28am

Hi Team,

There seems to be inconsistency in extraction process of html and paragraphs from document.

When I extract paragraphs from document , it is giving separate paragraphs whereas in html (extracted from document) , it is combing those documents. ( for some cases.) Can you please check.

Attaching the document and code for reference. For example issue is coming for this :

In html , it is concatenating this :
By XPO. XPO covenants and agrees with Supplier that during the Term and the Termination Assistance Period XPO shall comply, in all material respects, with all Laws applicable to XPO, and, except as otherwise provided in this Agreement, shall obtain all applicable material permits and licenses required of XPO in connection with its obligations under this Agreement.

Whereas , while reading paras using aspose api , it gives ‘By XPO’ as separate para and does not concatenate.

htmlwordextractionissue.7z (63.7 KB)

Code :

public static void main(String[] args) throws Exception {
        com.aspose.words.License license = new com.aspose.words.License();
        license.setLicense("/home/saurabharora/Downloads/Aspose.Total.Product.Family.lic");
        com.aspose.words.Document document = new com.aspose.words.Document("/home/saurabharora/Downloads/htmlwordextractionissue.docx");

        document.save("/home/saurabharora/Downloads/First Attachment_test.docx");

        for (Paragraph para : (Iterable<Paragraph>) document.getChildNodes(NodeType.PARAGRAPH, true)) {
            if(para.getText().startsWith("By XPO")){
                System.out.println("text found");
            }

            System.out.println(para.getText().trim());
        }

        HtmlSaveOptions opts = new HtmlSaveOptions(SaveFormat.HTML);
        opts.setExportPageSetup(true);
        opts.setExportListLabels(ExportListLabels.BY_HTML_TAGS);
        opts.setExportImagesAsBase64(false);
        opts.setExportFontsAsBase64(true);
        opts.setExportTocPageNumbers(true);
        opts.setExportPageMargins(true);
        opts.setExportShapesAsSvg(true);
        opts.setExportHeadersFootersMode(ExportHeadersFootersMode.FIRST_PAGE_HEADER_FOOTER_PER_SECTION);
        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
        document.save(byteArrayOutputStream, opts);
        String html = byteArrayOutputStream.toString(StandardCharsets.UTF_8);
        System.out.println(html);
}

alexey.noskov · April 29, 2024, 4:11am

@ashu_agrawal_sirionlabs_com The behavior is expected "By XPO" text is a separate paragraph in the Document Object Model:

<w:p>
	<w:pPr>
		<w:pStyle w:val="4"/>
		<w:numPr>
			<w:ilvl w:val="1"/>
			<w:numId w:val="1"/>
		</w:numPr>
		<w:tabs>
			<w:tab w:val="left" w:pos="2160"/>
		</w:tabs>
		<w:rPr>
			<w:vanish/>
			<w:color w:val="FF0000"/>
		</w:rPr>
	</w:pPr>
	<w:bookmarkStart w:id="38" w:name="_Toc85356572"/>
	<w:bookmarkStart w:id="39" w:name="_Ref189646383"/>
	<w:bookmarkStart w:id="40" w:name="_Toc1645603374"/>
	<w:r>
		<w:t xml:space="preserve">By </w:t>
	</w:r>
	<w:bookmarkEnd w:id="38"/>
	<w:bookmarkEnd w:id="39"/>
	<w:r>
		<w:t>XPO</w:t>
	</w:r>
	<w:bookmarkEnd w:id="40"/>
</w:p>
<w:p>
	<w:pPr>
		<w:pStyle w:val="3"/>
		<w:tabs>
			<w:tab w:val="left" w:pos="2160"/>
		</w:tabs>
		<w:ind w:firstLine="720"/>
	</w:pPr>
	<w:r>
		<w:rPr>
			<w:b/>
		</w:rPr>
		<w:t>.</w:t>
	</w:r>
	<w:r>
		<w:t xml:space="preserve">  XPO covenants and agrees with Supplier that during the Term and the Termination Assistance Period XPO shall comply, in all material respects, with all Laws applicable to XPO, and, except as otherwise provided in this Agreement, shall obtain all applicable material permits and licenses required of XPO in connection with its obligations under this Agreement.</w:t>
	</w:r>
</w:p>

But paragraph break is marked as <w:vanish/>, i.e. it is hidden. See para.getParagraphBreakFont().getHidden().

ashu_agrawal_sirionlabs_com · April 29, 2024, 4:38am

Thanks for the reply.

Can we set it as not hidden by reading through all paras first and then performing processing on it??

alexey.noskov · April 29, 2024, 6:08am

@ashu_agrawal_sirionlabs_com Sure you can set this property to false. But in this case the text will be still represented as separate paragraphs. If you need to join text in paragraphs, you should check whether paragraph break is hidden using the above mentioned property and in this case join the text of the next paragraph to the current one.