Split a large file

timg · December 11, 2006, 10:03am

I need to split a large PDF file into individual pages, then extract the text from each page to parse information from it. (Each page contains information for one employee).

I am using PdfFileEditor.SplitToPages(), and on a 5 page test document, it takes about 30 seconds to split the pages. The document I need to split is about 7000 pages. Can I assume that the time to split the pages will increase linearly with the size of the document? That is, if 5 pages take 30 seconds, will 7000 pages take about 42000 seconds? If so, is there a quicker way to split the document into separate pages?

Also, once the pages are split, I am parsing through the text to find an employee number (in the form 999-9999). Is PdfExtractor.ExtractText() and GetText() along with RegEx search the best/quickest way to do this?

Thanks for your help.

GeorgieYuan · December 11, 2006, 5:54pm

Hi timq,

PdfExtractor.ExtractText() and GetText() have no RegEx search ways, you can only search the result text file from the GetText() output yourself.

And so,
Could you please attach your PDF file here or send it to georgie via private message

And we can help you to test it and find if it have ways to improve the performance of splitting.

timg · December 18, 2006, 4:16pm

Unfortunately, the document contains confidential information, so I cannot send it to you. However, I did some testing of my own and found that it was MUCH faster for me to use the PdfEditor.Extract() function to break the document into chunks of 100 pages, and then split those chunks into individual pages using PdfEditor.SplitToPages().

I first did it using PdfEditor.SplitToPages() without breaking the document into chunks. I let this run for over half an hour and it never finished. I tried it again, this time processing 100 pages at a time with PdfEditor.Extract(), then splitting those 100 pages up with SplitToPages(). With this method, the whole 1000 page document was processed in under 6 minutes, much faster than using SplitToPages() by itself.

I also noticed that it was faster to use PdfEditor.Extract() to extract 100 pages ten times (for a total of 1000 pages) than it was to use the same function to do 1000 pages all at one time.

Hopefully this will help someone.