Extract and merge pages

Hello,

I have one pdf master document and I want to extract some of the pages based on whether there is text (pattern) on that page. It then wants to merge the resulting pages into one pdf file.

The main document can be quite large. Search patterns can be hundreds. So, I’m wondering what my best way is in terms of performance.

Examples:

If the master document contains 100 pages and the pattern “c1@gmail.com” is found in 5 pages I want to extract them and merge into another pdf.

If the pattern “c2@gmail.com” is found in 15 pages I want to extract them and merge into another pdf.

Regards

@profiler

To extract specific pages from a PDF document based on the presence of certain text patterns and then merge those pages into a new PDF file, you can follow these steps using Aspose.PDF for Java:

  1. Load the Master PDF Document: Start by loading your master PDF document.

  2. Search for Patterns: Iterate through each page of the document and search for the specified text patterns. You can use the TextAbsorber class to extract text from each page.

  3. Extract Pages: If a page contains the desired text pattern, add that page to a new PDF document.

  4. Merge PDF Documents: Once you have extracted all relevant pages, you can merge them into a single PDF file.

Here is a sample code snippet demonstrating this process:

import com.aspose.pdf.*;

public class ExtractAndMergePDF {
    public static void main(String[] args) {
        // Load the master PDF document
        Document masterDoc = new Document("path/to/master.pdf");
        Document extractedDoc = new Document();

        // Define the patterns to search for
        String[] patterns = {"c1@gmail.com", "c2@gmail.com"};

        // Iterate through each page in the master document
        for (int i = 1; i <= masterDoc.getPages().size(); i++) {
            Page page = masterDoc.getPages().get_Item(i);
            TextAbsorber textAbsorber = new TextAbsorber();
            page.accept(textAbsorber);
            String pageText = textAbsorber.getText();

            // Check for each pattern
            for (String pattern : patterns) {
                if (pageText.contains(pattern)) {
                    // If the pattern is found, add the page to the new document
                    extractedDoc.getPages().add(page);
                    break; // No need to check other patterns for this page
                }
            }
        }

        // Save the extracted pages to a new PDF file
        extractedDoc.save("path/to/extracted.pdf");
    }
}

Performance Considerations

  • Batch Processing: If the master document is large and you have many patterns to search for, consider processing the document in batches or using multi-threading to improve performance.
  • Text Search Optimization: Depending on the complexity of your patterns, you might want to implement more efficient text searching algorithms or libraries.

This approach allows you to efficiently extract and merge pages based on text patterns, ensuring that you can handle large documents and multiple search patterns effectively.

For more detailed information on PDF page manipulation, you can refer to the Aspose documentation on PDF page manipulation techniques here.

Sources:
[1]: PDF Page Manipulation | Aspose.PDF Java PDF Processing API

Thanks for the quick reply!

Is it possible to search the text only in header or footer?

@profiler
I’ll try to investigate , could you please attach the document so I can try to check how it could be done?

Header - Footer Example.pdf (22.3 KB)

1 Like

@profiler
Thanks, I’ll write you as soon as I finish investigating

@profiler
First solution I can suggest is using regex search in TextFragmentAbsorber in combination with regex for target email address :

var input = InputFolder + "Header_Footer.pdf";
var output = OutputFolder + "Header_Footer_out.pdf";

Aspose.Pdf.Document pdfDocument = new Document(input);

//regex searching for specific email
var textFragmentAbsorber = new TextFragmentAbsorber("C2@dynamosoftware.com ");
textFragmentAbsorber.TextSearchOptions = new TextSearchOptions(true);
//rectangle of header location
textFragmentAbsorber.TextSearchOptions.Rectangle = new Rectangle(0,700,600,1000);
pdfDocument.Pages.Accept(textFragmentAbsorber);

List<Page> pages = new List<Page>();
foreach (var fragment in textFragmentAbsorber.TextFragments)
{
    pages.Add(fragment.Page);
}
var new_doc = new Document();
foreach (var page in pages)
{
    new_doc.Pages.Add(page);
}
//in your case , merge to another document
new_doc.Save(output);

Header_Footer_out.pdf (41.5 KB)

Thanks for the reply!

How to calculate the right borders of the header and the footer of pdf page dynamically using page.PageInfo?

@profiler
I think you can try something like following:

//this is our start point of rectangle height
double height_delta = 100; 

var rectangleAbsorber = new TextFragmentAbsorber("veli@dynamosoftware.com");
rectangleAbsorber.TextSearchOptions = new TextSearchOptions(true);

while (rectangleAbsorber.TextFragments.Count == 0)
{
//here we create rectangle at the top of page  with page's width and height = delta
    rectangleAbsorber.TextSearchOptions.Rectangle = 
        new Rectangle(0, testPage.PageInfo.Height - height_delta,
        testPage.PageInfo.Width, testPage.PageInfo.Height);

    pdfDocument.Pages.Accept(rectangleAbsorber);
    if (rectangleAbsorber.TextFragments.Count == 0)
    {
        //if we haven't found any fragments we increase rectangle height
        height_delta+= height_delta;
    }
}

Thanks! What is the best approach to search text on multiple lines? For example, find Jimmy Page on page 3 and Mick Jagger on page 5.

Here is a sample document:

masterDocMultipleLines.pdf (44.0 KB)

@profiler Please specify, do you want to get them simultaneously or as separate queries?
In general I suppose TextFragmentAbsorber with regex in input and TextSearchOptions(true) covers most of cases
Perhaps you mean something more specific?
I’ll provide a code snippet shortly

Thanks for the reply!

I’m currently migrating an existing functionality we have (using another 3rd party component) to Aspose. What we need is to search for text on multiple lines, as I already mentioned. Here is my Unit test code:

public void SplitOnMultipleLines()
{   
    var data = new DataTable();
    data.Columns.Add(new DataColumn("Full name", typeof(string)));

    var r = data.Rows.Add();
    r[0] = "Adele";

    r = data.Rows.Add();
    r[0] = "Bono";

    r = data.Rows.Add();
    r[0] = "Madonna";

    r = data.Rows.Add();
    r[0] = "Jimmy Page";

    r = data.Rows.Add();
    r[0] = "Mick Jagger";

    var master = GetFileFromEmbeddedResources("masterDocMultipleLines.pdf");

    var pagesCount = 0;

    foreach (DataRow dr in data.Rows)
    {
        var pattern = (string)dr[0];

        if (Find(master, pattern))
            pagesCount++;
    }

    Assert.AreEqual(5, pagesCount);
}

public static bool Find(string masterDocumentLocation, string text)
{
    var masterDocument = new Aspose.Pdf.Document(masterDocumentLocation);

    foreach (Page page in masterDocument.Pages)
    {
        if (FindTextInHeader(page, text)) return true;
    }

    return false;
}

public static bool FindTextInHeader(Page page, string text)
{
    double pageWidth = page.PageInfo.Width;
    double pageHeight = page.PageInfo.Height;
    double delta = 100;

    // Create a TextAbsorber to extract text within this region
    var rectangleAbsorber = new TextFragmentAbsorber(text);
    rectangleAbsorber.TextSearchOptions = new TextSearchOptions(true);

    while (rectangleAbsorber.TextFragments.Count == 0 && delta <= 800)
    {
        //here we create rectangle at the top of page with page's width and height = delta
        rectangleAbsorber.TextSearchOptions.Rectangle =
            new Rectangle(0, page.PageInfo.Height - delta, pageWidth, pageHeight);

        page.Accept(rectangleAbsorber);
        if (rectangleAbsorber.TextFragments.Count > 0)
            return true;
        else
        {
            //if we haven't found any fragments we increase rectangle height
            delta += delta;
        }
    }

    return false;
}

Regards,
Velislav

@profiler

Update for header searching - you can use rectangleAbsorber.Visit(testPage); instead of doc.Pages.Accept(rectangleAbsorber) - in this case you’ll work only with one page and it’s enough for header size calculation so it will improve performance a bit

As for multiline - code will look somthing like following:

var input_multiline = InputFolder + "masterDocMultipleLines.pdf";
Aspose.Pdf.Document pdfDocument = new Document(input_multiline);
//regex means Jimmy+(1+ whitespaces) + (0+ newlines) + (0+ whitespaces)+Page OR
// same construction  with Mick Jagger
var textFragmentAbsorber =
 new TextFragmentAbsorber(new Regex(@"Jimmy\s+\n*\s*Page |Mick\s+\n*\sJagger"));
textFragmentAbsorber.TextSearchOptions = new TextSearchOptions(true);
pdfDocument.Pages.Accept(textFragmentAbsorber);

I skipped part with page collection and moving to new document but it’s same as before
Header_Footer_out.pdf (25.6 KB)

Thanks for the quick reply! Works for me :blush:

1 Like