Hi Team,
There seems to be inconsistency in extraction process of html and paragraphs from document.
When I extract paragraphs from document , it is giving separate paragraphs whereas in html (extracted from document) , it is combing those documents. ( for some cases.) Can you please check.
Attaching the document and code for reference. For example issue is coming for this :
In html , it is concatenating this :
By XPO. XPO covenants and agrees with Supplier that during the Term and the Termination Assistance Period XPO shall comply, in all material respects, with all Laws applicable to XPO, and, except as otherwise provided in this Agreement, shall obtain all applicable material permits and licenses required of XPO in connection with its obligations under this Agreement.
Whereas , while reading paras using aspose api , it gives ‘By XPO’ as separate para and does not concatenate.
htmlwordextractionissue.7z (63.7 KB)
Code :
public static void main(String[] args) throws Exception {
com.aspose.words.License license = new com.aspose.words.License();
license.setLicense("/home/saurabharora/Downloads/Aspose.Total.Product.Family.lic");
com.aspose.words.Document document = new com.aspose.words.Document("/home/saurabharora/Downloads/htmlwordextractionissue.docx");
document.save("/home/saurabharora/Downloads/First Attachment_test.docx");
for (Paragraph para : (Iterable<Paragraph>) document.getChildNodes(NodeType.PARAGRAPH, true)) {
if(para.getText().startsWith("By XPO")){
System.out.println("text found");
}
System.out.println(para.getText().trim());
}
HtmlSaveOptions opts = new HtmlSaveOptions(SaveFormat.HTML);
opts.setExportPageSetup(true);
opts.setExportListLabels(ExportListLabels.BY_HTML_TAGS);
opts.setExportImagesAsBase64(false);
opts.setExportFontsAsBase64(true);
opts.setExportTocPageNumbers(true);
opts.setExportPageMargins(true);
opts.setExportShapesAsSvg(true);
opts.setExportHeadersFootersMode(ExportHeadersFootersMode.FIRST_PAGE_HEADER_FOOTER_PER_SECTION);
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
document.save(byteArrayOutputStream, opts);
String html = byteArrayOutputStream.toString(StandardCharsets.UTF_8);
System.out.println(html);
}