Convert a big word document into multiple HTML documents

Hello,
We are building a java application and I want to convert a big word document into separate HTML documents based on the pagebreaks in the word document.

My idea is:

  1. Convert a big word document into multiple word documents based on pagebreaks. i.e., A word document with 3 page breaks will be split into 4 separate word documents.
  2. Convert the separate documents into HTML documents.
    Word gurus, feel free to suggest alternative solutions. The input is a big word document and the output is separate HTML documents.

We are looking into Aspose.Word for this. Any help will be appreciated.
Thanks,

Hi,
Thank you for considering Aspose.Words. As far as I can conclude from your description, the main problem is to find the best option to extract document contents for subsequent HTML conversion. Well using page breaks is surely possible but I don’t think this is the easiest way. I can see at least two alternative solutions:

  1. Use continuous section breaks. This will only require importing several sections to a bunch of small documents. However, it will be hard to determine what sections to extract if your large document already consists of several sections.

  2. Use bookmarks to mark the contents to extract. We have a nice code sample which shows how to perform this type of extraction:

    https://docs.aspose.com/words/net/working-with-bookmarks/

After you have created documents to convert, all you need to do is save them to the HTML format:
https://docs.aspose.com/words/net/convert-a-document-to-html-mhtml-or-epub/
Do you think the alternative approaches suggested above are applicable in your case? Also, feel free to post further questions if you have any.

Thanks for the response!

Yes, you are right about the expectations. We have huge word docsthat we want convert into HTML documents so that they can be read in a browser. A 100 page word document will make a huge HTML document and it won’t be a good user experience. So we want to convert the word document into multiple HTML pages baed upon page breaks. I say page breaks because the clients will insert page breaks in the word document. Unfortuately this is fixed because the client will only supply us already created word documents. We cannot ask them to insert section breaks or bookmarks.

So now my question is, how hard it will be to detect a page break in a word document and then split it? Does the API contain any methods that may be helpful, regardless of how difficult it is? Regardless of difficulty, is it even possible?

What if we save the big HTML document as an xml file (the MS XML format). Can we use XSLT to break it up and convert into HTML? I am not sure if XSLT can detect a page break either.

Thanks a lot.

Yes, it is possible to break the document into pages if they are seaprated with the page breaks. I have posted the related source code in Aspose.Words
https://reference.aspose.com/words/net/aspose.words/document/extractpages/
Concerning XSLT. As far as I know WordML to HTML conversion with XSLT is very complex. The only implementation up-to-date was done by Microsoft itself. It is called Word 2003 XML Viewer. I don’t know if it is feasible to adapt it to your task though.
Hope this helps,

Hello,
Thanks a lot for the inofrmation and sample code. I was able to create a custom java class that breaks a document by page breaks. The way MSWORD treats page breaks is not very easy to understand. But after a few trial and errors and printing the word document as flat text, I was able to figure out the mystery behind page breaks.

If anyone needs it, I can share the code with you.

I have another question for aspose folks: You know the basic idea: split a big word document into multiple word documents based on page breaks and then convert them into HTML. The challenge now is to update the TOC links to HTTP links to individual HTML pages.

Question is: Is there anyway in aspose API to grab a link in a word document and modify it and ultimately convert it into an HTTP link?

I did search on this forum but saw this following link being mentioned at couple of places but unfortunately the link doesnt work anymore?
Github

Thanks a lot.

Hi,
The code sample was moved to the following place:
https://reference.aspose.com/words/net/aspose.words.fields/fieldhyperlink/
However, I’m not sure it will help because as far as I remember TOC links are not represented by HYPERLINK fields. Could you please attach your document to experiment with?

Thanks, I am looking into it.

Could you please tell me your email so that I can email you the document?

I would prefer an email because of the size of the diocument and its sensitive nature.

Thanks

Replied via a private message.

Did you receive my last email?

Yes, thanks. I replied to it.

Hi Molecule,

I am looking for the code ,which breaks the document into multiple pages corresponding to the page breaks in java.

Could you please forward the code to my mail.

My mail id is ruby123@gmail.com

Thanks in advance.

Hi Ruby,
Thanks for your inquiry. Please check my reply at following thread. Hope this helps you.
https://forum.aspose.com/t/56856
Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.