Get Base64 content of each para from the word document

Karthik_Test_account · August 11, 2023, 1:21pm

Hi Team,

In our project, we perform a snapshot operation on Word document which means we load an entire word document into Aspose document and split the doc with each paragraph convert the content into stream and then literally into string Base64.This Base 64 is converted to image in the server side (Angular/ HTML). Here the string Base64 contains lot of empty spaces. Please see below for detailed information

Code to convert doc to string Base 64:

var tempSourceDoc = new Document(Source Doc folder);
var tempTargetDoc = new Document(Temp Doc folder);

DocumentBuilder tempSourceDocBuilder = new DocumentBuilder(tempSourceDoc);
DocumentBuilder tempTargetDocBuilder = new DocumentBuilder(tempTargetDoc);

tempSourceDocBuilder.Writeln("Source Para 1");
tempTargetDocBuilder.Writeln("target Para 1");

tempSourceDoc.UpdateFields();
tempTargetDoc.UpdateFields();

tempSourceDoc.Compare(tempTargetDoc, "automated", DateTime.Now);

MemoryStream imageStream = new MemoryStream();
tempSourceDoc.Save(imageStream, Aspose.Words.SaveFormat.Jpeg);

byte[] contentsOfPdf = imageStream.ToArray();
var imageContent = Convert.ToBase64String(contentsOfPdf);

image Content looks like,
SampleImage.GetJpegPageRange.jpeg (73.1 KB)

In the above image you can see for a single para entire page is allocated. I am sure if this is feasible but is it possible to capture the snapshot (string base 64) of only the para content without the space? if yes please guide with the code.

Thanks,
Karthikeyan

Konstantin.Kornilov · August 11, 2023, 7:37pm

@Karthik_Test_account This is expected behavior. Document.Save methods saves entire page to the image. As an easy workaround you could reduce the page size to the desired values via DocumentBuilder.PageSetup property. As a more sophisticated workaround you could determine the paragraph bounding box with LayoutEnumerator class and then crop the page image to the paragraph box.

Karthik_Test_account · August 14, 2023, 12:52pm

Thank you @Konstantin.Kornilov.

Is there a way to read the content of Primary Header alone and load it into a Document and take a snapshot (base 64 string) of that?

Also is there a way to get the entire content/ word count/ content length of a doc from Aspose word?

Konstantin.Kornilov · August 14, 2023, 10:04pm

@Karthik_Test_account Unfortunately it is not possible to render the header alone to the image. But you could render the page with the header to the image, get header bounding box with layout enumerator and crop the image accordingly.
You can get document textual content with Document.GetText() method. Also you could use Document.BuiltInDocumentProperties to get the number of characters, words, lines etc. But it may be required to call Document.UpdateWordCount() method if you update the document content.