Aspose.PDF for java takes long time for 4k replacements

muthukrishnanm · August 9, 2017, 5:04am

We are using Aspose PDF for Java for text replacements in PDF. For a reasonable 4600 replacements in a pdf, the library takes nearly 5 minutes.

I am attaching the code sample, font here.
aspose-test.zip (59.8 KB)
PDF is available at https://app.box.com/s/atktip2li1kxfwlleku5kbukdoy89uk3
OS: Ubuntu 12.04
Java : Java 1.8

Text replacements cannot take this delay. Please let me know what is wrong here?

Thanks
Muthu

asad.ali · August 9, 2017, 12:06pm

@muthukrishnanm

Thanks for contacting support.

I have tested the scenario in an environment (i.e Ubuntu 15.04, JDK 1.8, Eclipse Oxygen Release (4.7.0), RAM 4GB, Java Heap Space 1024M) with Aspose.Pdf for Java 17.7 and observed that code execution consumed a lot of memory, which resulted as java.lang.OutOfMemoryError Exception.

However, I have logged an investigation ticket as PDFJAVA-36986 in our issue tracking system with all relevant details of the issue. We will further investigate the reasons behind this and keep you updated with the status. Please be patient and spare us little time.

Now that I was unable to execute the code without any error, so could not observe delay in execution. Would you please share the size of memory installed in your system and Java Heap Size as well, so that we can test the scenario again in our environment and address it accordingly.

We are sorry for the inconvenience.

muthukrishnanm · August 9, 2017, 3:15pm

Thank you @asad.ali for looking into this.

This is my Xmx parameter -Xmx4508m

The size of the system main memory is 6GB.

asad.ali · August 9, 2017, 9:00pm

@muthukrishnanm

Thanks for sharing requested information.

We will test the scenario using specified parameters and let you know with our findings. Please be patient and spare us little time.

muthukrishnanm · August 10, 2017, 4:06am

Thank you @asad.ali

But this is very important for us and will like to get this fixed as soon as possible.

asad.ali · August 10, 2017, 12:13pm

@muthukrishnanm

Thanks for contacting support.

I have tested the scenario by specifying Xmx parameters in the Linux (Ubuntu) environment but I still faced OOM (OutOfMemory) Error. Since I was unable to observe the delay in execution of code due to error, so I have given a try in Windows environment (i.e Windows 10 EN, Eclipse Neon, 8GB RAM, JRE 1.8) and I was able to notice that the execution took long time than expected.

Therefore, I have logged a performance issue as PDFJAVA-36987 in our issue tracking system with my findings of the scenario.

Now that issues have been logged in our system, so relevant team will investigate them as per their schedule. As soon as we have some definite updates regarding resolution of these issues, we will let you know. Please be patient and spare us little time.

We are sorry for the inconvenience.

muthukrishnanm · March 1, 2018, 9:20am

Hi @asad.ali

Any updates on this issue? Its been in the same state for a very long time.

Thank you
Muthu

asad.ali · March 1, 2018, 6:57pm

@muthukrishnanm

Thanks for your inquiry.

Our product team has investigated earlier logged issue PDFJAVA-36986 and as per their findings, API was taking long time for text replacement because of huge amount of connections between pdf objects that are produced by finding the requested text on the 216 pages.

If you use text replacing for pages with temporary saving the result, the execution time can be decreased. Please change the methods substituteTokens and getTextFragmentCollection in the following way:

private static TextFragmentCollection getTextFragmentCollection(Page page) {
         TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("[0-9a-f]{64}+|[0-9A-F]{40}+");
         TextSearchOptions textSearchOptions = new TextSearchOptions(true);
         textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
         page.accept(textFragmentAbsorber);
         return textFragmentAbsorber.getTextFragments();
                }

        private static void substituteTokens(String clearText, InputStream inputStream, OutputStream outputStream) {
         Document pdfDocument = new Document(inputStream);
         long totalStart = System.currentTimeMillis();
         
         for (int i=1; i<=pdfDocument.getPages().size();i++)
         {
             Page p = pdfDocument.getPages().get_Item(i);
             TextFragmentCollection textFragmentCollection = getTextFragmentCollection(p);
             for (TextFragment textFragment : textFragmentCollection) {
//                 System.out.println(p.getNumber());
//                 System.out.println(textFragment.getText());
                     textFragment.getTextState().setFont(FONT);
                     textFragment.setText(clearText);
                    }         
             if(i%100==0)//save temp result on every 100th page
             {
                 try
                {
                    pdfDocument.save(new FileOutputStream(new File(dataDir + "outputReplaced_temp"+version+".pdf")));
                    pdfDocument = new Document(new FileInputStream(new File(dataDir + "outputReplaced_temp"+version+".pdf")));
                    System.out.println("The document is partially processed");
                } catch (FileNotFoundException e)
                {        
//something comes wrong           
                    e.printStackTrace();
                    break;
                }
                 
             }
         }
         long totalEnd = System.currentTimeMillis();
         System.out.println("Total time taken for all replacements "+ (totalEnd-totalStart)+" milliseconds for all the replacements");
         pdfDocument.save(outputStream);
        }

Please note that conversion can take 20 minutes with code snippet which you have already shared with us - whereas it will take ~8 minutes if you use suggested code snippet. Please try using suggested code snippet with latest version of the API (i.e Aspose.PDF for Java 18.2) an in case you face any issue, please feel free to let us know./

asad.ali · August 19, 2018, 8:04pm

@muthukrishnanm

Thanks for your patience.

In reference to above logged ticket, the conversion does not hang with parameter -Xmx4508m, but it can happen if the available memory will be less than 1Gb and it depends upon Java Garbage Collector configuration. In order to avoid this issue, we advise replacing text page-by-page and temporary save document. The following code works twice faster:

private static TextFragmentCollection getTextFragmentCollection(Page page) {
         TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("[0-9a-f]{64}+|[0-9A-F]{40}+");
         TextSearchOptions textSearchOptions = new TextSearchOptions(true);
         textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
         page.accept(textFragmentAbsorber);
         return textFragmentAbsorber.getTextFragments();
                }

        private static void substituteTokens(String clearText, InputStream inputStream, OutputStream outputStream) {
         Document pdfDocument = new Document(inputStream);
         long totalStart = System.currentTimeMillis();
         
         for (int i=1; i<=pdfDocument.getPages().size();i++)
         {
             Page p = pdfDocument.getPages().get_Item(i);
             TextFragmentCollection textFragmentCollection = getTextFragmentCollection(p);
             for (TextFragment textFragment : textFragmentCollection) {
//                 System.out.println(p.getNumber());
//                 System.out.println(textFragment.getText());
                     textFragment.getTextState().setFont(FONT);
                     textFragment.setText(clearText);
                    }         
             if(i%100==0)//save temp result on every 100th page
             {
                 try
                {
                    pdfDocument.save(new FileOutputStream(new File(dataDir + "outputReplaced_temp.pdf")));
                    pdfDocument = new Document(new FileInputStream(new File(dataDir + "outputReplaced_temp.pdf")));
                    System.out.println("The document is partially processed (100 pages)");
                } catch (FileNotFoundException e)
                {        
//something comes wrong           
                    e.printStackTrace();
                    break;
                }
                 
             }
         }
         long totalEnd = System.currentTimeMillis();
         System.out.println("Total time taken for all replacements "+ (totalEnd-totalStart)+" milliseconds for all the replacements");
         pdfDocument.save(outputStream);
        }

Please use above code snippet with Aspose.PDF for Java 18.7 and in case you face any issue, please let us know.