Extract Text between two strings

Hi,


I wanted to check if there is a way to extract text between two strings on a page using aspose.pdf for java. I have a scenario where I need to extract text between two strings and if I do not find the end I need to count certain number of characters and retrieve them.

Regards,

Hi Rajeev,


Thanks for your inquiry. You need to use regular expression to search text between two strings. Please check sample code snippet, you may improve/modify as per your need.

//open document<o:p></o:p>

com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("Input.pdf");

//create TextAbsorber object to find all instances of the input search phrase

//

com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = new com.aspose.pdf.TextFragmentAbsorber("(?<=starting word)(.*?)(?=ending word)",new TextSearchOptions(true));

//from+".*"+till,new TextSearchOptions(true));

//accept the absorber for first page of document

pdfDocument.getPages().accept(textFragmentAbsorber);

//get the extracted text fragments into collection

com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();

//loop through the Text fragments

for(com.aspose.pdf.TextFragment textFragment : (Iterable)textFragmentCollection)

{

System.out.print(textFragment.getText());

}


Please feel free to contact us for any further assistance.


Best Regards,

Hi Tilal,

Thanks for your response. I tried executing the code and it works fine when the start and end is on the same page. The scenario I am testing will have the start and end on separate pages.

Is it that I have to use multiple TextFragmentAbsorber or is there a way to have TextFragmentAbsorber accepted across multiple pages.

Regards,

Hi Rajeev,


Thanks for the acknowledgement.

In above stated code, a single instance of TextFragmentAbsorber is used to iterate through all the pages inside PDF file when using pdfDocument.getPages().accept(…) method. However can you please share the resource PDF file and current code snippet which you are using, so that we can test the scenario in our environment. We are sorry for this inconvenience.

Hi,


I have attached the file where I am facing the issue. Mt TextFragmentAbsorber is as bellow, where “AsopTestPdf.java:150” text is on first page and fileSize :- 135316 on the second page. In this scenario it does not print anything.

com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = new com.aspose.pdf.TextFragmentAbsorber("(?<=AsopTestPdf.main(AsopTestPdf.java:150))(.*?)(?=fileSize :- 135316)",new TextSearchOptions(true));

Regards,

Hi Rajeev,


Thanks for sharing the additional information. I am afraid currently Aspose.Pdf is not supporting the search of text spanning over two pages, so I have logged a ticket PDFNEWJAVA-35095 in our issue tracking system for further investigation and rectification. We will notify you as soon as it is resolved.

We are sorry for the inconvenience caused.

Best Regards,