IndexOutOfRange Exception thrown from replacing text on a .docx converted to .pdf file

zywave · August 10, 2017, 1:22pm

Hello, I’m running into an issue replacing text in a PDF that was generated from an original .docx file in Aspose.Words 17.8 and Aspose.PDF 17.8 for .NET. I’ve narrowed the issue down and generated a very small console application that can reproduce the issue which I will attach. Essentially I am reading in a very simple not special .docx file created in Word 2016, just ReplaceMe1 through ReplaceMe4 on a new line.
Here is the summary of my application:

I am reading that .docx document into an Aspose.Words.Document
Immediately calling .Save with a SaveFormat.PDF into a stream
Using that Stream to create an Aspose.Pdf.Document
Looping through a Dictionary of strings replacing the key with the value
Depending on some conditions, attempting to replace “ReplaceMe3” will cause an IndexOutOfRange Exception to be thrown: See below
image.png (16.9 KB)

Here is where it gets a little weird, this exception only gets thrown in certain conditions.

“ReplaceMe1” and “ReplaceMe2” need to both attempting be replaced with values
The error will only occur when “ReplaceMe1” is being replaced with certain characters, such as a ‘j’, ‘c’, or ‘q’.

If you only try to replace the first two keys with the conditions above the file gets generated, however the PDF looks like this: Broken PDF Image.png (21.1 KB)

If I save the .docx as a PDF inside Microsoft Word I do not run into this issue, everything merges successfully. So it seems to be some issue between the Aspose.Words saving as a PDF causing Aspose.PDF’s ReplaceText to interpret something differently, than if I would use Microsoft Word to save it to a PDF.

Here are links to a zip to all of my test files and console application:Aspose Test.zip (46.3 KB)

Thank you for your help in looking into this.

tilal.ahmad · August 10, 2017, 7:06pm

@zywave,

Thanks for your query. Please amend your code as following, it will help you to resolve the issue.

....
var wordDocument = new Document(fileStream);
PdfSaveOptions options = new PdfSaveOptions();
options.SaveFormat = SaveFormat.Pdf;
options.ExportDocumentStructure = true;
wordDocument.Save(pdfConverstionStream, options); 
.....

zywave · August 10, 2017, 7:50pm

Hi @tilal.ahmad,

That seemed to do the trick, thank you for your help. I see in the documentation (PdfSaveOptions.ExportDocumentStructure | Aspose.Words for .NET) that this property could have a significant increase in memory consumption. Any comment on how profound that memory increase is?

tilal.ahmad · August 11, 2017, 6:13am

@zywave

Thanks for your feedback. It is good to know that your issue has been fixed with the suggestion. Please note ExportDcoumentStructre property as its name describes it exports the logical structure of all the document elements required for other PDF standards e.g. Tagged PDF. So it takes more memory for processing and larger output file size as well.

However, memory consumption depends upon input file size and you can optimize output file size with following options.

options.OptimizeOutput = true;
options.TextCompression = PdfTextCompression.Flate;