Hi,
We have a requirement to extract the text between the strings "5.3 Preclinical safety data" and "6. PHARMACEUTICAL PARTICULARS" in the attached document. We are using aspose.pdf-11.0.0.jar and used the following regular expression in our code.
TextFragmentAbsorber tfa = new TextFragmentAbsorber("Preclinical safety data.*6\\.", new TextSearchOptions(true));pdfDocument.getPages().accept(tfa);
This expression works only if the above two strings are within the same page. How would I match it no matter whether the strings are in the same page or spanned across multiple pages.
Second requirement is to extract each paragraph separately within the matching text. Thus in the attached pdf, the following paragraphs should be extracted
(1) Repeated-dose studies of up to 3-months duration have been conducted in rat and dog. Maximum daily exposures (AUC) at the No Observed Adverse Effect Levels in the 3-month study in rat were 3.6 times and in the 4 week study in dog 9.4 times the AUC in humans after a subcutaneous dose of 30 mg.
-----
So on till
---------
(7) Icatibant did not elicit any cardiac conduction change in vitro (hERG channel) or in vivo in normal dogs or in various dog models (ventricular pacing, physical exertion and coronary ligation) where no associated hemodynamic changes were observed. Icatibant has been shown to aggravate cardiac ischemia in several non-clinical models, although a detrimental effect has not consistently been shown in acute ischemia. Due to species differences in the effect of bradykinin, translation of the results obtained in animals to man is difficult.
Kindly let us know
Regards
Sujith Babu