How to split pdf into multi pdfs from particular key word

rjc0916 · April 22, 2015, 8:42am

Hello Friends,

We are using Aspose products in our CRM application. we are using Aspose.Pdf for one particular requirement. now we have requirement of reading whole pdf and split that pdf into different pdfs. say for example, there would be one key word in a pdf which will start from #(e.g #Test Word). Now I need to find all the words which starts from the # tag in that pdf and split rest of the thing in different pdfs. So if there are 5 occurrences of # tag then we should have 5 different pdfs containing relevant data after # key word.

Can anyone help me in this? any immediate help with detail step by step code would be highly appreciated.

Thanks & Regards

Ashish Rajguru

rjc0916 · April 23, 2015, 12:59am

Hello Friends,

I have not heard anything back from you guys. Is it not possible to split pdf from particular keyword with Aspose.pdf? I have gone through some example of reading pdf using document and TextFragmentAbsorber objects but i am not exactly getting use of these objects and it also not seems useful to me. I would appreciate if someone of you help me out in this.

Thanks & Regards

Ashish Rajguru

codewarior · April 23, 2015, 10:03am

Hi Ashish,

Thanks for your interest in our API’s and sorry for the delayed response.

Aspose.Pdf for .NET offers the feature to search particular TextFragments inside PDF file and you can also retrieve its formatting information as well as the page number over which it resides. It also provides the feature to split PDF pages to individual page documents, so as per your requirement, you can search for particular TextFragments/Segments inside PDF file, get page number information and then split or extract that specific page to separate PDF file. For further details, please visit

rjc0916 · April 30, 2015, 1:41pm

Hello Nayyer,

I have gone through all above articles as reference of my functionality but i am still stuck with main logic part. According to above articles it will find particular keyword from pdf and it will also find on which page it exists. but at the time of splitting it gives only that page on which it exists. Let me explain you my requirement. e.g. if “Test” keyword exist on 1st page and then it directly exists on 5th page of pdf document. on page no 2,3,4 there is no keyword “Test” exists. so i need to split all 4 pages( page no 1,2,3,4) into one pdf. so in short i need to keep splitting all the pages into different pdf until it finds another occurrence of that keyword. it must be some inner or extra logic i need to write in following part but i am not exactly getting how can i achieve it.

//open document

Document pdfDocument = new Document(“F:/Delimiter.pdf”);

//TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(“Sample”);

//TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+", new TextSearchOptions(true));

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(“Loan”, new TextSearchOptions(true));

//accept the absorber for all the pages

pdfDocument.Pages[1].Accept(textFragmentAbsorber);

//get the extracted text fragments

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

//loop through the fragments

//Save the page as PDF file

Document newDocument = new Document();

foreach (TextFragment textFragment in textFragmentCollection)

{

//In this loop i need to write some logic which i am not getting how to do that

//Get particular page

Page pdfPage = pdfDocument.Pages[textFragment.Page.Number];

newDocument.Pages.Add(pdfPage);

newDocument.Save(“output.pdf”);

}

any help would be highly appriciated.

Thanks & Regards

Ashish Rajguru

codewarior · May 3, 2015, 2:19pm

Hi Ashish,

Thanks for sharing the details.

As per my understanding, I would suggest you to search instances of keyword “Test” and get the page number information over which it exists, and then get all the pages between particular pages. For more information, please visit Extract Array of PDF Pages Using File Paths (Facades)