How to remove hidden TOC bookmarks from the document using Java

rcomniscien · May 26, 2020, 5:02am

Hi,

I need to count word by page. And when I extract text from DOC, it auto generate object type such as:

HYPERLINK \l "_Toc41130032" Content1 PAGEREF _Toc41130032 \h 1
HYPERLINK "http://www.virginia.edu/registrar/forms/coursecataloginstructions.doc" CCI Instructions form
FORMCHECKBOX  School/College  
FORMTEXT      Term/Year

Screenshot from 2020-05-25 11-40-16.png (729.4 KB)

My code as below. And for WordsPageSplitter, I refer from this: https://forum.aspose.com/t/extract-text-for-each-page/204962/2

try (Writer writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(sFileName, false), StandardCharsets.UTF_8)) ) {
    WordsPageSplitter splitter = new WordsPageSplitter(doc);
    for (int page = 1; page <= doc.getPageCount(); page++)
    {
        com.aspose.words.Document pageDoc = splitter.getDocumentOfPage(page);
        String contents = pageDoc.getText();
    }

} catch (Exception e)
{
    throw e;
}
finally { }

regards,
Rapeepan

tahir.manzoor · May 26, 2020, 12:20pm

@rcomniscien

Please note that Aspose.Words mimics the behavior of MS Word. The objects like “_Toc41130032” are hidden bookmarks and hyperlinks for table of content. You can remove the TOC field from the document using Field.Remove method and call Document.UpdateFields method to avoid such object types from the extracted document.