Need an advice to read the text from PDF

supam · March 2, 2017, 6:20am

Hello Team,

We are evaluating the fitness of the product before purchasing and we are using asp.pdf to read the content inside the pdf and display it in the web form.

Business requirement - we will have pdf with some standard content, what we need to do is we need to read that PDF and map the content to web page.

For Example : Suppose I have the pdf form call “EmployeeRegistration.pdf”

which will have contents like this

Name: Jim Glyn

Age: 28

Description:

Some paragraph Some paragraph

Now, what we want is we need to read all this field value Name ,Age , Description and map them to the same ui fields.

Waiting for your reply.

Suprith

amsuprith@gmail.com

asad.ali · March 3, 2017, 5:39am

Hi Suprith,

Thanks for using our API.

In order to read field value(s) from a PDF form you will need to navigate from all fields in the Collection of the Form and get the value against the specific field name. Please check the “Get Values form all fields of a PDF Document” section in our API documentation. You may also read the value from a particular field in the PDF form by specifying the field name. You may check “Get value from individual field” section in API documentation for that requirement. In case of any further query please feel free to let us know.

Best Regards,

supam · March 4, 2017, 2:07am

Hi Team,

Thanks for swift response :).

This holds good form PDF forms, however I am talking about the Plain PDF which looks like form and with out form elements in it. how to deal in this case. We need to identity the values entered for a particular fields.

Note : We will be knowing the field name ahead of time as they are the templates specified from us. So the challenge is how to extract the right content entered to that field.

Attached the sample PDF for reference. If this gets sorted out we will use apose for our pdf operations.

Regards,

Suprith

asad.ali · March 6, 2017, 5:31am

Hi Suprith,

Thanks for sharing sample document. I am looking into the details to extract particular text from the document and will get back to you shortly.

Best Regards,

supam · March 7, 2017, 7:55am

Thanks waiting for your reply.

asad.ali · March 7, 2017, 3:06pm

Hi Suprith,

Thanks for your patience. In order to extract text from PDF file Aspose.Pdf offers various features and you may check the “Extract Text form PDF” section in the documentation. However as per your requirement there can be a workaround to extract text against specific string as your PDF file only contains plain text.

Please check the following code snippet which I have used to extract values against some fields i.e Name, Number and Value (USD). I have extracted all text from PDF using TextAbsorber and perform some string operations to extract the required value.

Document pdfDocument = new Document(dataDir + "SamplePDF.pdf");

TextAbsorber ta = new TextAbsorber();
pdfDocument.Pages.Accept(ta);

string extractedtext = ta.Text;
string[] contents = extractedtext.Split('\n');

foreach (string s in contents)
{
    if (s.Contains("Name"))
        Console.WriteLine(s.Replace("Name*", "").TrimStart().TrimEnd());

    if (s.Contains("number"))
        Console.WriteLine(s.Replace("number", "").TrimStart().TrimEnd());

    if (s.Contains("Value"))
        Console.WriteLine(s.Replace("Value (USD)*", "").TrimStart().TrimEnd());
}

You can use the above code snippet to extract values against particular field. In case of any further assistance please feel free to let us know.

Best Regards,

supam · March 8, 2017, 2:37am

Thanks for the replay, this holds good for single text what about the field
"Additional details*" which is having multiple text. Thanks in advance.

asad.ali · March 8, 2017, 11:48am

Hi Suprith,

Thanks for your inquiry. Please check the following code snippet which I have used to extract the value of Additional Details from sample PDF.

Document pdfDocument = new Document(dataDir + "SamplePDF.pdf");

TextAbsorber ta = new TextAbsorber();
pdfDocument.Pages[1].Accept(ta);

string extractedtext = ta.Text;

string additionaldetails = extractedtext.Substring(extractedtext.IndexOf("Additional Details*") + "Additional Details*".Length).Replace("1", "").TrimEnd().TrimStart();

Console.WriteLine("Additional Details: " + additionaldetails);

You may use the above code to extract the multi-line text or you may also modify it as per your requirement. In case if you need any further assistance please feel free to contact us.

Best Regards,

supam · March 30, 2017, 12:46am

Hi Team,

Thanks for your response. We are using the above approach however we have below two questions.

1. When we extract use the textabsorber will it retain the format of my text?.
2. We have PDF document of 600 pages. Does product have any limitation on the number of pages?

Thanks,

Suprith

asad.ali · March 30, 2017, 8:55am

Hi Suprith,

Thanks for your inquiry.

supam:

When we extract use the textabsorber will it retain the format of my text?

In order to preserve the formatting of the text, you need to use TextExtractionOptions and pass it to the constructor of TextAbsorber Class. Please check the following code snippet to use TextExtractionOptions while extracting text from a PDF.

Document pdfDocument = new Document(dataDir + "SamplePDF.pdf");

TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);

TextAbsorber ta = new TextAbsorber(textExtOptions);

pdfDocument.Pages[1].Accept(ta);

supam:

We have PDF document of 600 pages. Does product have any limitation on the number of pages?

Aspose.Pdf for .NET API has no such limitations in terms of document size and number of pages a document contains. You may extract text from single page of the document as well as from the entire document. For more information please visit “Working With Text” section in our API documentation. In case you need further assistance, please feel free to let us know.

Best Regards,

supam · April 1, 2017, 12:08am

Brilliant, Thanks for the response. We will get back to you if we face any issues.

asad.ali · April 3, 2017, 5:10am

Hi Suprith,

Thanks for your feedback. Please take your time to implement the functionality and in case of any issue, please feel free to contact us. We will be more than happy to extend our support.

Best Regards,

abhiso · April 6, 2017, 5:58am

Hi Asad,

We are trying to add rows to an existing table in a PDF file using .NET.

We can read table using TableAbsorber and get an object of “AbsorbedTable” which does not let us add rows to it.

Looking for an approach to go about this.

fahadadeel · April 7, 2017, 3:10am

Hi Suprith,

Thanks for contacting support.

I am afraid the currently Aspose.Pdf for .NET API does not support the feature to add rows to an existing table. However for the sake of implementation, we already have logged this requirement in our issue tracking system under New Features list as PDFNET-38999 We will further investigate this requirement in details and will keep you updated on the status of its implementation.

We apologize for your inconvenience.

Best Regards,

supam · June 14, 2017, 1:03am

Hi Team,

Could you pleaser let us know a way to lock the pdf from manipulation from the external process.

We are looking a way in aspose to mark the pdf as not editable in a way even from aspose for that matter.

Regards,
Suprith

codewarior · June 14, 2017, 3:01am

Hi Amsuprith,

Thanks for contacting support.

In order to accomplish your requirements, please try following the instructions specified over Set Privileges, Encrypt and Decrypt PDF File

Besides this, you may consider using Viewer API of our sister company named GroupDocs which provides the feature to display PDF and other documents and users cannot make any modifications to the file being displayed (even they cannot copy the contents of file being displayed).