We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Extracting text between two strings in two different pages paragraph by paragraph

Hi,
We have a requirement to extract the text between the strings "5.3 Preclinical safety data" and "6. PHARMACEUTICAL PARTICULARS" in the attached document. We are using aspose.pdf-11.0.0.jar and used the following regular expression in our code.
TextFragmentAbsorber tfa = new TextFragmentAbsorber("Preclinical safety data.*6\\.", new TextSearchOptions(true));
pdfDocument.getPages().accept(tfa);

This expression works only if the above two strings are within the same page. How would I match it no matter whether the strings are in the same page or spanned across multiple pages.

Second requirement is to extract each paragraph separately within the matching text. Thus in the attached pdf, the following paragraphs should be extracted
(1) Repeated-dose studies of up to 3-months duration have been conducted in rat and dog. Maximum daily exposures (AUC) at the No Observed Adverse Effect Levels in the 3-month study in rat were 3.6 times and in the 4 week study in dog 9.4 times the AUC in humans after a subcutaneous dose of 30 mg.

-----
So on till
---------

(7) Icatibant did not elicit any cardiac conduction change in vitro (hERG channel) or in vivo in normal dogs or in various dog models (ventricular pacing, physical exertion and coronary ligation) where no associated hemodynamic changes were observed. Icatibant has been shown to aggravate cardiac ischemia in several non-clinical models, although a detrimental effect has not consistently been shown in acute ischemia. Due to species differences in the effect of bradykinin, translation of the results obtained in animals to man is difficult.

Kindly let us know
Regards
Sujith Babu

Hi Sujit,


Thanks for contact support.

I am working on testing the scenario in my environment and will get back to you soon.

Hi Nayyer,

Is there any update on this
Regards
Sujith Babu

Hi Sujith,


Thanks for your patience.

I have been trying to test the scenario using code lines you have shared earlier but I am afraid its not returning the file contents. Can you please share the code snippet which you are using, so that we can test the scenario in our environment. We are sorry for this inconvenience.

[Java]

// Open a document<o:p></o:p>

com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("c:/pdftest/HTMl_to_PDFouput (1).pdf");

// Create TextAbsorber object to find all instances of the input search phrase

com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("(?i)Preclinical safety data *6\\.", new TextSearchOptions(true));

// Set text search option to specify regular expression usage

com.aspose.pdf.TextSearchOptions textSearchOptions = new com.aspose.pdf.TextSearchOptions(true);

textFragmentAbsorber.setTextSearchOptions(textSearchOptions);

// Accept the absorber for first page of document

pdfDocument.getPages().accept(textFragmentAbsorber);

// Get the extracted text fragments into collection

com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();

// Loop through the fragments

for(com.aspose.pdf.TextFragment textFragment : (Iterable)textFragmentCollection)

{

System.out.println("Text :- " + textFragment.getText());

System.out.println("Page Number:- " + textFragment.getPage().getNumber());

}