Extracting text between two strings in two different pages paragraph by paragraph

SujithBabuAG · March 9, 2016, 4:04am

Hi,

We have a requirement to extract the text between the strings "5.3 Preclinical safety data" and "6. PHARMACEUTICAL PARTICULARS" in the attached document. We are using aspose.pdf-11.0.0.jar and used the following regular expression in our code.

TextFragmentAbsorber tfa = new TextFragmentAbsorber("Preclinical safety data.*6\\.", new TextSearchOptions(true));

pdfDocument.getPages().accept(tfa);

This expression works only if the above two strings are within the same page. How would I match it no matter whether the strings are in the same page or spanned across multiple pages.

Second requirement is to extract each paragraph separately within the matching text. Thus in the attached pdf, the following paragraphs should be extracted

(1) Repeated-dose studies of up to 3-months duration have been conducted in rat and dog. Maximum daily exposures (AUC) at the No Observed Adverse Effect Levels in the 3-month study in rat were 3.6 times and in the 4 week study in dog 9.4 times the AUC in humans after a subcutaneous dose of 30 mg.

-----

So on till

---------

(7) Icatibant did not elicit any cardiac conduction change in vitro (hERG channel) or in vivo in normal dogs or in various dog models (ventricular pacing, physical exertion and coronary ligation) where no associated hemodynamic changes were observed. Icatibant has been shown to aggravate cardiac ischemia in several non-clinical models, although a detrimental effect has not consistently been shown in acute ischemia. Due to species differences in the effect of bradykinin, translation of the results obtained in animals to man is difficult.

Kindly let us know

Regards

Sujith Babu

codewarior · March 10, 2016, 7:25am

Hi Sujit,

Thanks for contact support.

I am working on testing the scenario in my environment and will get back to you soon.

SujithBabuAG · March 14, 2016, 12:33am

Hi Nayyer,

Is there any update on this

Regards

Sujith Babu

codewarior · March 15, 2016, 9:07am

Hi Sujith,

Thanks for your patience.

I have been trying to test the scenario using code lines you have shared earlier but I am afraid its not returning the file contents. Can you please share the code snippet which you are using, so that we can test the scenario in our environment. We are sorry for this inconvenience.

[Java]

// Open a document
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("c:/pdftest/HTML_to_PDFoutput (1).pdf");

// Create TextAbsorber object to find all instances of the input search phrase
com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(
    "(?i)Preclinical safety data *6\\.", 
    new com.aspose.pdf.TextSearchOptions(true)
);

// Set text search option to specify regular expression usage
textFragmentAbsorber.setTextSearchOptions(new com.aspose.pdf.TextSearchOptions(true));

// Accept the absorber for the pages of the document
pdfDocument.getPages().accept(textFragmentAbsorber);

// Get the extracted text fragments into collection
com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();

// Loop through the fragments
for (com.aspose.pdf.TextFragment textFragment : textFragmentCollection) {
    System.out.println("Text: " + textFragment.getText());
    System.out.println("Page Number: " + textFragment.getPage().getNumber());
}