Doc2HTML - Specifying number of pages to convert

I am building a preview page for word document results on our search page. Performance is very important for us since we cannot afford to pre-cache previews. It will have to be done on-the-fly.

As a quick prototype, I used the evaluation version of Aspose.Words to build a preview handler. The product gave good fidelity and great performance for large documents with that version but rendered a red evaluation notice at the beginning and end of the trimmed html.

When we used a licensed copy of Aspose.Words this conversion took significantly longer for larger documents. We are only interested in the first few pages for the preview. I could not find an API to specify the number of pages to be converted to html.

Can you expose the evaluation behavior as a feature for licensed users by taking the number of pages (or size in bytes) as input to render trimmed html without the evaluation notice?

Thanks,
Nameet


This message was posted using Aspose.Live 2 Forum

Hello!

Thank you for your interest in Aspose.Words,

I’m sorry. I cannot see the matter of your question. In evaluation mode Aspose.Words applies some restrictions due to evident reasons. But when you apply a license it processes the entire file. What kind of parameterization do you mean?

Regards,

I want to restrict the number of pages that are converted to HTML even when using the license. I want to be able to specify the number of the pages to be rendered.

The evaluation mode works well for us for large documents since Aspose.Words does a stream read until an arbitrary point in the document and then stops reading further. I am not interested in rendering all ~80 pages of the document in HTML. The intent is to give the user a quick preview of the first few pages.

For obvious reasons, we do not want to use the Evaluation mode beyond the prototype.

The question is, is it possible for you to load and convert only a certain specified number of pages (or size) to HTML from the input document?

Please give me a call between 9 am - 5 pm EST if you need more details.

Aspose.Words doesn’t perform rendering before it outputs the HTML. It works with logical structure of documents without hints of how they could be layouted. You can cut the documents before conversion at the point you wish. For that purpose you can count paragraphs, characters or whatever you like.

Regards,

I found another thread based on your tip that might help me with this task:

https://forum.aspose.com/t/108270

Thanks for your help!

Hello!

I have looked the thread you are referencing. They split documents at the points of “Heading 1” paragraphs so each top-level heading and associated content forms separate document. Your criteria could be different. Maybe it should depend on the document structure or any other specifics. If you show us a typical document sample we can help to find a clue.

Regards,

I need a trial license to test performance of this API for very large documents. Can you please create a temporary evaluation license for me to use?

I want to ensure that using new Document(Stream) and then using a similar algorithm as the one I linked to in my previous post doesn’t cause the entire document to be loaded into memory and I can stop after say, the first 50 paragraphs.

Thanks

Hi

Thanks for your request. Please see the following link to learn how to get temporary license.
https://purchase.aspose.com/temporary-license

Best regards.

Hello!

I’d like to make a clarification. Any Aspose.Words API call that loads the document causes the whole document to be loaded into memory. You cannot avoid this. Of course we thought about the problem with very large documents. But according to the file formats we support this could be difficult to load documents by smaller parts. The best workaround is avoiding very large documents at all since even MS Word performs awfully on them and may hang up.

You wrote something about ~80 pages. That’s okay, not so huge. We don’t recommend working with documents larger than 10 megabytes. You can try 20 but this will depend on your computer’s hardware configuration and on other simultaneously executed tasks.

Thank you for understanding.

Regards,

Like I mentioned before, we are trying to build a preview functionality for our Search results. We are indexing TBs of documents to make them available for searching and there is no way to control the size of documents that come up (rather there is no choice, everything needs to be searchable).

Many of these documents end up being well over 20 MB and to save the user’s time (downloading the document on the local machine and then opening in word is too slow) we want to make a quick html preview available that can be shown within a couple of seconds of clicking the result.

Your evaluation version turned out to be a perfect solution since it cuts the stream at source and loads only a small part of the document.

Can you think of a way to expose this as a feature for the full version?

Hi again!

I see what you are developing and what the problem is arising. Thank you.

The most of supported formats are designed so as we must read the entire document into memory. We could try implementing something like Document.LoadPreview or somehow else control the loading process. But this needs some investigation. Probably not all formats could win significant performance and memory consumption benefit.

If you are evaluating you can estimate how Aspose.Words with no evaluation restrictions (under temporary license) performs on documents much larger than 20 megabytes. Looking on some testing results we would be able to make suggestions together. Please share this experience with us.

Regards,

Here are the test results from the prototype we did last month -

Sample document charateristics:

File size: 20 MB

No. of Pages: 176

Images: Almost every page contains a large image within a table layout.

Doc2HTML conversion time (includes load time + convert time, we did not record a split up):

With evaluation restrictions: consistently less than 2 seconds.

With temporary license (no restriction): consistently over 12 seconds.

Let me know if this helps and what your suggested next steps are.

We currently do not have a temporary license to work with.

Hello!

Thank you for your measurements. I see the difference is not less than 6 times. You can try loading the document as usual and cutting it before you save as HTML. Please try this and we’ll see whether it is affordable.

To obtain a new temporary license you can ask Purchase team. That’s usually okay.

Regards,