Split Pdf by phrase and name it with another word found on it

Panayotap · November 29, 2023, 7:23am

What is the mechanism for splitting PDF while searching for a phrase and naming the split PDF with another word found on each page programmatically using C# and Aspose.Pdf?

asad.ali · November 29, 2023, 6:07pm

@Panayotap

You can split PDF document on the basis of pages. There is no such direct functionality in the API to split the document on the basis of content like text inside pages. Also, as per the structure of PDF file format, it would be a complex feature to split and re-arrange the content. Nevertheless, if possible - can you please share your sample PDF along with an expected output for our reference?

Panayotap · November 30, 2023, 10:44am

Whenever I find this specific phrase “τρέχουσα ληξιπρόθεσμη δόση”, I want to split the pdf into a separate file. I want to split this large pdf file into N pdf files if I find it N times. I should name its Pdf file after the word found in it, which is “ΑΡΙΘΜ. ΛΟΓΑΡΙΑΣΜΟΥ ΔΑΝΕΙΟΥ: 4044080780”. During pages, this number changes. It’s actually the loan number. It’s a pdf file, and I want to name it with the loan number followed by the date, which appears on the name of the pdf file. I hope I have explained this clearly. I have uploaded a sample of my pdf file.
IRKT00_Report_10062021_00001.pdf (154.7 KB)

asad.ali · November 30, 2023, 5:00pm

@Panayotap

Please check the below code snippet with the attached output PDFs and let us know if this helps:

private static void SplitPDF(string dataDir)
{
    // Load the PDF document
    Document pdfDocument = new Document(dataDir + "IRKT00_Report_10062021_00001.pdf");

    // Search for the specific phrase
    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"ΔΑΝΕΙΟΥ:\s+(\d+)", new TextSearchOptions(true));
    pdfDocument.Pages.Accept(textFragmentAbsorber);

    // Iterate through the found occurrences
    foreach (TextFragment textFragment in textFragmentAbsorber.TextFragments)
    {
        // Get loan number from the page
        string loanNumber = GetLoanNumber(textFragment);

        // Create a new PDF document for the extracted content
        Document newPdfDocument = new Document();
        newPdfDocument.Pages.Add(pdfDocument.Pages[textFragment.Page.Number]);

        // Save the new PDF document
        string outputFileName = $"{loanNumber}.pdf";
        newPdfDocument.Save(dataDir + outputFileName);
    }

}

static string GetLoanNumber(TextFragment text)
{
    // Implement your logic to extract the loan number based on the page number
    // You might need to use TextFragmentAbsorber or other techniques to find and extract the loan number
    // Replace the following placeholder logic with your actual logic
    return text.Text.Replace(" ", "").Replace("ΔΑΝΕΙΟΥ:", "");
}

SplitPDF.zip (387.1 KB)

Panayotap · November 30, 2023, 8:05pm

Thank you! It works. I enrich the code, so it will search the other phrase too. Thanks again!

asad.ali · December 1, 2023, 12:40am

@Panayotap

Its nice to know that you are able to achieve your requirements. Please keep using our API and feel free to let us know in case you need further assistance.

Panayotap · December 1, 2023, 7:43am

Sure, I will. Thanks again!