Extract Text between two strings

rajeevkrmathur · July 9, 2015, 8:27am

Hi,

I wanted to check if there is a way to extract text between two strings on a page using aspose.pdf for java. I have a scenario where I need to extract text between two strings and if I do not find the end I need to count certain number of characters and retrieve them.

Regards,

tilal.ahmad · July 10, 2015, 9:09am

Hi Rajeev,

Thanks for your query. To search for text between two strings, you can use a regular expression in Aspose Libraries. Below is a sample Java code snippet that demonstrates this. You can modify it according to your needs.

Sample Java Code:

Open the document using the Aspose.PDF library.
Create a TextFragmentAbsorber object to search for text between two strings using a regular expression.
Accept the absorber for the desired pages of the document.
Extract and display the found text fragments.

// Open document
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("Input.pdf");

// Create TextFragmentAbsorber object to find all instances of the input search phrase
com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = new com.aspose.pdf.TextFragmentAbsorber("(?<=starting word)(.*?)(?=ending word)", new TextSearchOptions(true));

// Accept the absorber for the first page of the document
pdfDocument.getPages().accept(textFragmentAbsorber);

// Get the extracted text fragments into a collection
com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();

// Loop through the Text fragments
for (com.aspose.pdf.TextFragment textFragment : (Iterable)textFragmentCollection) {
    System.out.print(textFragment.getText());
}

Please feel free to contact us for any further assistance.

Best Regards,

rajeevkrmathur · July 10, 2015, 10:44am

Hi Tilal,

Thanks for your response. I tried executing the code and it works fine when the start and end is on the same page. The scenario I am testing will have the start and end on separate pages.

Is it that I have to use multiple TextFragmentAbsorber or is there a way to have TextFragmentAbsorber accepted across multiple pages.

Regards,

codewarior · July 13, 2015, 4:01am

Hi Rajeev,

Thanks for the acknowledgement.

In above stated code, a single instance of TextFragmentAbsorber is used to iterate through all the pages inside PDF file when using pdfDocument.getPages().accept(…) method. However can you please share the resource PDF file and current code snippet which you are using, so that we can test the scenario in our environment. We are sorry for this inconvenience.

rajeevkrmathur · August 27, 2015, 9:03am

Hi,

I have attached the file where I am facing the issue. Mt TextFragmentAbsorber is as bellow, where “AsopTestPdf.java:150” text is on first page and fileSize :- 135316 on the second page. In this scenario it does not print anything.

com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = new com.aspose.pdf.TextFragmentAbsorber("(?<=AsopTestPdf.main(AsopTestPdf.java:150))(.*?)(?=fileSize :- 135316)",new TextSearchOptions(true));

Regards,

tilal.ahmad · August 28, 2015, 8:41am

Hi Rajeev,

Thanks for sharing the additional information. I am afraid currently Aspose.Pdf is not supporting the search of text spanning over two pages, so I have logged a ticket PDFNEWJAVA-35095 in our issue tracking system for further investigation and rectification. We will notify you as soon as it is resolved.

We are sorry for the inconvenience caused.

Best Regards,