We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

How to remove hidden TOC bookmarks from the document using Java

Hi,

I need to count word by page. And when I extract text from DOC, it auto generate object type such as:
HYPERLINK \l “_Toc41130032” Content1 PAGEREF _Toc41130032 \h 1
HYPERLINK “http://www.virginia.edu/registrar/forms/coursecataloginstructions.doc” CCI Instructions form
FORMCHECKBOX School/College
FORMTEXT Term/Year
Screenshot from 2020-05-25 11-40-16.png (729.4 KB)

My code as below. And for WordsPageSplitter, I refer from this: Extract text for each page

  				try ( Writer writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(sFileName, false), StandardCharsets.UTF_8 )) ) {
  					WordsPageSplitter splitter = new WordsPageSplitter(doc);
  			        for (int page = 1; page <= doc.getPageCount(); page++) {
  			        	com.aspose.words.Document pageDoc = splitter.getDocumentOfPage(page);
  			        	String contents = pageDoc.getText();
  			        }

  				} catch (Exception e) {
  					throw e;
  				} finally {}

regards,
Rapeepan

@rcomniscien

Please note that Aspose.Words mimics the behavior of MS Word. The objects like “_Toc41130032” are hidden bookmarks and hyperlinks for table of content. You can remove the TOC field from the document using Field.Remove method and call Document.UpdateFields method to avoid such object types from the extracted document.