Aspose PDF - Read existing PDF Paragraph by Paragraph

debashishr · January 24, 2018, 8:59am

First of All, Kudos to the Amazing PDF software created by Aspose. It makes working with PDF really easy and the documentation is pretty comprehensive.

So, I was able to experiment with quite a few APIs provided by you and it seems that most of our complicated requirements can be achieved by using Aspose.

However, I need to read the entire PDF, paragraph by paragraph and need to store each paragraph in a String Array(each string representing one paragraph). But I have been unable to do so. I researched the entire documentation a lot, but was unable to find a way to do so. Am I missing something. Could you please guide me to a way to achieve that.

imran.rafique · January 24, 2018, 8:14pm

@debashishr,

Aspose.Pdf for .NET API has support of retrieving the content by paragraph. This feature has been supported in the recent version 18.1 and the document will be added soon. Please try the following code example:

[C#]

Document doc = new Document(dataDir + "input.pdf");
ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.Visit(doc);

foreach (PageMarkup markup in absorber.PageMarkups)
{
    int i = 1;
    foreach (MarkupSection section in markup.Sections)
    {
        int j = 1;
        foreach (MarkupParagraph paragraph in section.Paragraphs)
        {
            StringBuilder paragraphText = new StringBuilder();

            foreach (List<TextFragment> line in paragraph.Lines)
            {
                foreach (TextFragment fragment in line)
                {
                    paragraphText.Append(fragment.Text);
                }
                paragraphText.Append("\r\n");
            }
            paragraphText.Append("\r\n");
            Console.WriteLine("Paragraph {0} of section {1} on page {2}:", j, i, markup.Number);
            Console.WriteLine(paragraphText.ToString());
            j++;
        }
        i++;
    }
}

debashishr · January 28, 2018, 4:26am

Is there anyway to do the same in Java. I tried to find ParagraphAbsorber but could not find the same in Aspose Java libraries.

asad.ali · January 28, 2018, 10:02pm

@debashishr

Thanks for writing back.

I am afraid that feature is not currently available in Aspose.PDF for Java 17.12, which is latest version of the Java API. However, it is expected that this feature will be ported from Aspose.PDF for .NET into Java API in upcoming release Aspose.PDF for Java 18.1. As soon as the feature is available for Java, we will definitely inform you within this forum thread. Moreover, we would really appreciate if you can please share the JDK/JRE version with which you are working in your environment, as this information will be useful for us, in order to maintain our API respectively.

We are sorry for the inconvenience.

debashishr · January 31, 2018, 11:56am

Even the features that are available is awesome so you dont need to apologize. Its cool that you are implementing this new feature in the future versions. We are presently using Java 8. only and can always upgrade if required. Could you give us a rough idea of how soon the next versions will be available.

asad.ali · January 31, 2018, 7:13pm

@debashishr

Thanks for your kind feedback.

The next release of Aspose.PDF for Java is expected to be launched in current week or early in next week. However, we have linked the ticket ID PDFJAVA-35762 with this forum thread, so that you will receive notification once the feature is available in Java API.

debashishr · February 2, 2018, 2:25pm

Hi Asad,

Can you please specify how to access the jar using Maven/pom.xml

asad.ali · February 2, 2018, 10:25pm

@debashishr

Thanks for contacting support.

Please visit “Install Aspose.PDF for Java” article in our API documentation, in order to use API in your Maven Project. Furthermore, we are pleased to inform that earlier logged feature request PDFJAVA-35762 has been fulfilled in latest version Aspose.PDF for Java 18.1. However, due to some reasons, we were unable to load latest version of the API over Maven Repository, but for now you may please download and use it from Downloads section of our website.

In event of any further query, please feel free to let us know.

debashishr · February 8, 2018, 3:04pm

Hi Asad,

Since the new version were not available, we used Aspose to convert the pdf to word and then used Apache POI libraries to read paragraph by paragraph. However, when we convert it, the word document does not contain plain text, but it creates an object, something like textbox and the text is displayed inside that. Is it possible to convert pdf so that the text is rendered in plain text in the word document.

asad.ali · February 8, 2018, 8:12pm

@debashishr

Thanks for getting back to us.

Would you please your sample PDF document along with the code snippet which you have used to convert PDF into DOC/DOCX format. We will test the scenario in our environment and address it accordingly.

debashishr · February 12, 2018, 12:56pm

Hi Asad,

I used the following code (in line with the example provided above), but only a part of the 1st line got printed.
Document doc = new Document(“PDF1.pdf”);
ParagraphAbsorber pa = new ParagraphAbsorber();
pa.visit(doc);
for (PageMarkup pm:pa.getPageMarkups()){
for (MarkupSection ms:pm.getSections()){

	for (MarkupParagraph mp:ms.getParagraphs()){
		StringBuilder sb =new StringBuilder();
		for(List< TextFragment> tflist : mp.getLines()){
			for(TextFragment tf:tflist ){
				sb.append(tf.getText());
			}
			sb.append("/r/n");
		}
		sb.append("/r/n");
		System.out.println(sb);
	}
}

}
}
I have enclosed the file too.PDF1.pdf (156.6 KB)

asad.ali · February 12, 2018, 9:32pm

@debashishr

Thanks for contacting support.

I have tested the scenario using your code snippet and was unable to notice any issue. As an output, the code extracted all the text from PDF document. For your kind reference, I have share the screenshot of code snippet and its console output as well. extracted_text.png (25.8 KB)

Please make sure, that you set license before using any API feature and license should also be valid. In case you still face any issue, please share your environment details i.e JDK version, OS details, Application Type etc. - so that we can again test the scenario in our environment and address it accordingly.

debashishr · February 13, 2018, 3:18am

Hi Asad,

I have tried it with Windows 10, JDK - 1.8.0_162.
I do not have the license though. Will I need to purchase it to use this feature?
I was hoping that I could do a POC for the entire application and then if everything works fine, give a demo to the managers and seek an approval for the license.
Can you please confirm that without the license I will not be able to use this feature and I shall speak to the Project lead accordingly

asad.ali · February 13, 2018, 9:23am

@debashishr

Thanks for writing back.

Please note that, while using a trial version of the API - you can only process 4 elements of any collection (e.g Paragraphs, Pages, Annotations, etc.) in the API. In case you want to evaluate API features completely, you may please apply for a 30-days temporary license over our website. Using this license, you will be able to use all features of the API without any limitation. In case of any further assistance, please feel free to contact us.

debashishr · March 26, 2018, 7:00am

Hi Asad,

We need to read a entire paragraph, delete it from existing PDF and replace it with a completely new text. The new text may be larger/shorter than the original text. However, when I am trying the same using Aspose, the text is not wrapping and going out of the page. I am still investigating alternate ways to do this. Could you please help or provide any guidance regarding the same.

asad.ali · March 26, 2018, 1:26pm

@debashishr

Thanks for contacting support.

Would you please share the sample PDF document along with sample code snippet, which you are trying to replace paragraph at your side. We will definitely test the scenario in our environment and address it accordingly.

debashishr · April 8, 2018, 7:13pm

Hi Asad,

So, as you are aware, we are creating an application in which we shall read a PDF paragraph by paragraph, send it to an existing application which will modify the text and then write back the new paragraph to the PDF.

Hence, we created a POC in which we replaced a few words with another set of words(ReplaceWord_Working). It worked fine after setTextReplaceOptions on TextAbsorber.

But we needed to send the entire paragraph to the application, hence we requested the feature you provided in the new version. Once we got that, we tried to replace the entire paragraph. But when we tried LineByLine, it did not work since we needed to work with the entire paragraph(not line) but we did notice that the text was not wrapping and we could not find any way to configure ParagraphAbsorber for that.

So, then we tried to (FindReplacePara)

Read entire paragraph on 1st iteration
Replace only the 1st line with the entire paragraph and set the other lines as “”(empty string)
but the text did not wrap, especially we did not use TextAbsorber but ParagraphAbsorber.

So finally,
we finally tried to combine the 3rd and 1st approach.

Read the entire paragraph
Replace the 1st text fragment with a position marker (“Paragraph1”)
Use the textabsorber, search for Paragraph1 and replace it by the entire paragraph.
Surprisingly the textabsorber did not wrap text this time.

I have attached the files for your perusal. Please let me know if there is any way that we could make it work.

Also, we would need a feature when in case of text within a table cells would wrap inside the cell and the cell size should resize.Attempts.zip (1.8 MB)

asad.ali · April 8, 2018, 10:07pm

@debashishr

Thanks for sharing sample documents and code snippet(s).

We have tested the scenario in our environment and managed to replicate the issue(s) which you have mentioned. However, we have logged these issues as following in our issue tracking system. We will further look into the details of the issue and keep you updated with the status of their correction. Please spare us little time.

PDFJAVA-35762 - Text is going out of Page Boundaries after converting it to upper-case
PDFJAVA-37627 - Text did not wrap up while replacing using ParagraphAbsorber

We have noticed that the output for above scenario seemed different from which we have generated by using your code snippet in our environment. Please check following both output files (Shared by you and generated in our environment.) and you will notice that there is “D.” in the started of the PDF document which you have shared. It is quite possible that you have used a different source document for that.

Please share respective input document with us, in order to test the third scenario again. We will log the ticket after observing the issue in our environment.

We are sorry for the inconvenience.

yogesh300890 · March 14, 2019, 1:11pm

Hi, I am trying to extract paragraph text using aspose pdf for java but its not working properly. I am using trial licence.

POM :

com.aspose
aspose-pdf
18.9

Code :
public static void getParagraphTextFromPDF(String filePath) {
Document doc = new Document(filePath);
ParagraphAbsorber pa = new ParagraphAbsorber();
pa.visit(doc);
for (PageMarkup pm:pa.getPageMarkups()){
for (MarkupSection ms:pm.getSections()){

            for (MarkupParagraph mp:ms.getParagraphs()){
                StringBuilder sb =new StringBuilder();
                for(List< TextFragment> tflist : mp.getLines()){
                    for(TextFragment tf:tflist ){
                        sb.append(tf.getText());
                    }
                    sb.append("/r/n");
                }
                sb.append("/r/n");
                System.out.println(sb);
            }
        }
    }
}

Input file : Sample.pdf (154.3 KB)

Output :
[SOUTHERN COMPANY LOGO]/r/n/r/n
SOUTHERN COMPANY GENERATION/r/n/r/n
Semiannual Period./r/n/r/n
withheld, but in no event shall such consent be withheld if the following/r/n/r/n

CONFIDENTIAL MATERIAL HAS BEEN/r/n/r/n

asad.ali · March 14, 2019, 8:03pm

@yogesh30890

Thanks for contacting support.

We have tested the scenario in our environment with using Aspose.PDF for Java 19.2 and noticed that API was able to extract all paragraphs from your PDF document. Please check attached output generated over our end.
Paragraphs.zip (8.2 KB)

Would you please make sure to use latest version of the API with valid license. In case you are not using valid license or do not have one, please consider applying for 30-days free temporary license. In case you still face any issue, please feel free to let us know.