PDF Text replace and Html Conversion taking log time

srikanth03565 · April 1, 2016, 4:17am

Hi,

Aspose PDF to html conversion and text replacement is taking too much time , its taking more than 1 min in many cases.

Please see attached code and file.

public Document loadDocument(String documentPath,String name) {
Document docObj = new Document(documentPath);
DocumentInfo docInfo = docObj.getInfo();
docInfo.setAuthor(clientName);
docInfo.setCreationDate(new java.util.Date());
//docInfo.addItem(“Producer”, name);
// docInfo.addItem(“Creator”, name);
docInfo.setKeywords("");
docInfo.setModDate(new java.util.Date());
docInfo.setSubject(name);
docInfo.setTitle(name);
return docObj;
}

public void doReplacement(){
String fileName = “5.pdf”;
Document documentObj = loadDocument(fileName,“Srikanth”)
ArrayList wordList = new ArrayList();
wordList.add(“India”);
wordList.add(“other”);
wordList.add(“name”);
wordList.add(“years”);
for (String word : wordList) {
this.replaceDocumentByRegex(documentObj, word, “");
}
this.replaceDocumentByRegex(documentObj,"[A-Z0-9._%±]+@[A-Z0-9.-]+\.[A-Z]{2,4}", "”);
this.replaceDocumentByRegex(documentObj, “^\+(?:[0-9]\ ?){6,14}[0-9]$”, “****”);
convertToHtml( documentObj, “output.html”)
}
public void replaceDocumentByRegex(Document documentObj, String Regex, String Replacement)
{
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(Regex); // like
// 1999-2000
// Set text search option to specify regular expression usage
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
// Accept the absorber for first page of document
documentObj.getPages().accept(textFragmentAbsorber);
// Get the extracted text fragments into collection
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// Loop through the fragments
for (TextFragment textFragment : (Iterable) textFragmentCollection) {
// Update text and other properties
textFragment.setText(Replacement);
// textFragment.getTextState().setFont(com.aspose.pdf.FontRepository.findFont(“Verdana”));
// textFragment.getTextState().setFontSize(22);
// textFragment.getTextState().setForegroundColor(com.aspose.pdf.Color.getBlue());
// textFragment.getTextState().setBackgroundColor(com.aspose.pdf.Color.getGray());
}

}

public void convertToHtml(Document documentObj, String outHtmlFile) {
HtmlSaveOptions newOptions = new HtmlSaveOptions();
// Enable option to embed all resources inside the HTML
newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
newOptions.RasterImagesSavingMode = RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
// This is just optimization for IE and can be omitted
newOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
documentObj.save(outHtmlFile, newOptions);
}

//call above methods like
doReplacement();

attaching pdf file and also we are unable to change pdf property’s like Producer, Creator etc

Also My email Replacement is not working , i have used same regex for aspose word its working as expected. Can u place suggest why my email regex is not working here.

srikanth03565 · April 1, 2016, 9:03am

Also My Email Regex replacement is not working on PDF files.

Its is not replacing any thing , same regex is working as expected on aspose word.

tilal.ahmad · April 4, 2016, 2:00am

Hi Srikanth,

Thanks for your inquiry. I have tested your sample use on Win 7 64 bit with 8GB RAM using Aspose.Pdf for Java 11.3.0 and unable to notice the performance issue. It is taking almost 17 seconds. Please download and try latest version of Aspose.Pdf for Java, hopefully it will resolve the issue.

Please note as per Aspose.Pdf design user can not set Producer and Creator property.

Best Regards,

tilal.ahmad · April 4, 2016, 2:02am

Hi Srikanth,

srikanth03565:

Also My Email Regex replacement is not working on PDF files.
Its is not replacing any thing , same regex is working as expected on aspose word.

Thanks for your inquriy. Please use following regex, it will help you to find email id.

replaceDocumentByRegex(documentObj,"[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}", "****");

Best Regards,

srikanth03565 · April 5, 2016, 12:43am

Hi,

Thanks for Quick replay.

I am using latest version of PDF only. And more over 17 sec is also huge .

If i am converting 100 pdf files one by one it will take 1700 seconds.

Can you suggest any methods in which i can preload library or fonts etc ?.

One why to reduce this is using multi threaded approach.

My second issue is after running 500 requests memory leakage issues are occurring. I have given 512 MB memory after running 500 requests getting out of memory issue.

tilal.ahmad · April 5, 2016, 11:22pm

Hi Srikanth

srikanth03565:

I am using latest version of PDF only. And more over 17 sec is also huge .

If i am converting 100 pdf files one by one it will take 1700 seconds.

Can you suggest any methods in which i can preload library or fonts etc ?.

One why to reduce this is using multi threaded approach.

Please note Aspose.Pdf processes files in memory, so performance depends upon the system resources and size/contents of the input file. Hence processing time for 100 different PDF files will be different as per file size/contents and system resources.

Furthermore, I am afraid there is no option to preload library or fonts etc. You may try multi-threaded approach for the purpose. But please note Aspose.Pdf is multi-thread safe as long as each thread works on different document. If you manipulate a single document in different threads, the results would be unstable.

srikanth03565:

My second issue is after running 500 requests memory leakage issues are occurring. I have given 512 MB memory after running 500 requests getting out of memory issue.

In reference to memory leakage issue, you can use MemoryCleaner object to clean the memory. After completing operations with Aspose.Pdf object, you can close object with close() or dispose() methods and finally use com.aspose.pdf.MemoryCleaner.clear() method. It clears Aspose.Pdf specific instances and hopefully will enables you to effective memory usage.

Please note it is recommended that you should call this method only if there is a shortage of available memory. Please find sample code to check memory status.

Runtime rt = Runtime.getRuntime();

long max = rt.maxMemory()/1048576;

long total = rt.totalMemory()/1048576;

long free = rt.freeMemory()/1048576;

long used = total - free;

Please feel free to contact us for any further assistance.

Best Regards,