Can Aspose extract the data inside the a given pdf file and edit it?

weibanban · August 3, 2015, 11:49pm

We are evaluating a requirement which requests to open an existing pdf file from other system. The character of the pdf fileis:

Multiple pages
The content may be some statements, some texts, and some tables.

We would like to know,

if the text content be editable,
for the tables. Can we edit it, and fill in new values?

codewarior · August 4, 2015, 8:32am

Hi Jeff,

Thanks for your interest in our API’s.

Aspose.Pdf for .NET provides the feature to extract and update PDF file contents. It also offers the feature to update contents of table in existing PDF file. For more information, please visit

Please try using following code snippet to update contents of table in existing PDF file.

[C#]

Document pdfDocument = new Document(inFile);<o:p></o:p>

// Create TableAbsorber object to find tables

TableAbsorber absorber = new TableAbsorber();

// Visit first page with absorber

absorber.Visit(pdfDocument.Pages[1]);

// Get access to first table on page, their first cell and text fragments in it

TextFragment fragment = absorber.TableList[0].RowList[0].CellList[0].TextFragments[1];

// Change text of the first text fragment in the cell

fragment.Text = "hi world";

pdfDocument.Save(outFile);

weibanban · August 4, 2015, 9:38pm

OK Thanks for quick feedback.

My further question is : Can Aspose convert the given pdf file to html page. The purpose of it would give user more freedom to edit the page, then save back to pdf file ?

codewarior · August 5, 2015, 4:10am

Hi Jeff,

Please visit the following link for required information on Convert PDF File into HTML Format

weibanban · August 7, 2015, 3:43am

Hi

Thanks. I read the documents you recommended, and successfully converted the pd file to the word, or html format.

However i found. if the given pdf file is simple. This conversion word document look good, But if there are tables. mixed with texts in the original pdf file. This conversion will have many mis-aligned, and make the converted word document hard to read.

is this a known issue?

tilal.ahmad · August 9, 2015, 11:50pm

Hi Jeff,

Thanks for your feedback. I am afraid there is no such know issue, it seems it is document specific issue. We will appreciate it if you please share a problematic document here. We will look into it and guide you accordingly.

Moreover, you may try Textbox recognition mode instead of Flow to preserve the original look of document.

We are sorry for the inconvenience caused.

Best Regards,

weibanban · August 11, 2015, 3:03am

Thanks

Suppose i have got the pdf file as attached. i am required to complete the following:

1. Extract the content in the first page. The table of “INTERPRETATION”. The existing content is “The patient was in normal sinus rhythm throughout.”.

2. Change the existing content to something else. Maybe “ABC”.

is it possible by using your Aspose?

Jeff

tilal.ahmad · August 12, 2015, 12:14am

Hi Jeff,

Thanks for your inquiry. We have a class TableAbsorber to search and edit existing simple tables. However It is malfunctioning at the moment. I have logged a ticket PDFNEWNET-39178 in our issue tracking system for investigation and rectification. We will notify you as soon as it is resolved.

We are sorry for the inconvenience caused.

Best Regards,

weibanban · August 12, 2015, 10:07pm

Dear Sirs

Maybe i can provide more information. The version of the pdf file I uploaded are 1.3 (Acrobat 4.x). The pdf file with higher version can work well.

Does Aspose control support this version of pdf file ?

weibanban · August 13, 2015, 2:58am

Dear Sirs

Continue to my previous question. Suppose I want to extract the “Narrative Summary” at the 2nd page of the attached pdf file. There are two paragraphs. How can I extract these text in two paragraphs?

I tried to use the following code snippets to extract text,

TextFragmentAbsorber tfa = new TextFragmentAbsorber();

pdfDocument.Pages[2].Accept(tfa);

TextFragmentCollection tfc = tfa.TextFragments;

foreach (TextFragment tf in tfc)

{

result += tf.Text + “\r\n”;

}

But the final strings are broken into 14 strings. Is there any better way for me to :

1. Get the content right after the title “Narrative Summary”, and before the ending of the page.

2. The content should be in two paragraphs.

tilal.ahmad · August 14, 2015, 2:11am

Hi Jeff,

weibanban:

Dear Sirs

Maybe i can provide more information. The version of the pdf file I uploaded are 1.3 (Acrobat 4.x). The pdf file with higher version can work well.

Does Aspose control support this version of pdf file ?

Thanks for your feedback. I am afraid the PDF version is not the cause of issue. I have upgraded it to 1.7 and tested again but still no success. We will share our findings/ETA, as soon as our product team complete the issue investigation.

Best Regards,

tilal.ahmad · August 14, 2015, 3:16am

Hi Jeff,

weibanban:
Dear Sirs

Continue to my previous question. Suppose I want to extract the "Narrative Summary" at the 2nd page of the attached pdf file. There are two paragraphs. How can I extract these text in two paragraphs?

I tried to use the following code snippets to extract text,

TextFragmentAbsorber tfa = new TextFragmentAbsorber();
pdfDocument.Pages[2].Accept(tfa);

TextFragmentCollection tfc = tfa.TextFragments;
foreach (TextFragment tf in tfc)
{
result += tf.Text + "\r\n";
}

But the final strings are broken into 14 strings. Is there any better way for me to :
1. Get the content right after the title "Narrative Summary", and before the ending of the page.
2. The content should be in two paragraphs.

Thanks for your inquiry. In this case you need to use regular expression to search text. Please check following code snippet to get text after "Narrative Summary". Hopefully it will help you to accomplish the task.

//open document

Document pdfDocument1 = new Document(@"Sample.pdf");

//create TextAbsorber object to find all the phrases matching the regular expression

TextFragmentAbsorber textFragmentAbsorber1 = new TextFragmentAbsorber(@"(?<=Narrative Summary)((.|\n)*)"); //like 1999-2000

//set text search option to specify regular expression usage

TextSearchOptions textSearchOptions = new TextSearchOptions(true);

textFragmentAbsorber1.TextSearchOptions = textSearchOptions;

//accept the absorber for all the pages

pdfDocument1.Pages[2].Accept(textFragmentAbsorber1);

//get the extracted text fragments

TextFragmentCollection textFragmentCollection1 = textFragmentAbsorber1.TextFragments;

//loop through the fragments

foreach (TextFragment textFragment in textFragmentCollection1)

{

Console.WriteLine("Text : {0} ", textFragment.Text);

}

Please feel free to contact us for any further assistance.

Best Regards,

weibanban · August 18, 2015, 12:54am

Dear Sirs

Thanks very much for your feedback. Yes. it does work!

I have one more question. Which is : do you have any UI control which can render the pdf document ? better it can work in an environment without installing the Adobe reader before. and it can accept a pdf streaming, rather than a pdf file path.

Many times we need to create temporally pdf file, This is because many control can only accept an existing physical pdf file, rather than a pdf streaming. To achieve the goal we have to create the file from the stream, and delete the pdf file afterwards. The best way is: The control can render the pdf file with the given pdf stream.

tilal.ahmad · August 18, 2015, 10:58pm

Hi Jeff,

Thanks for your feedback. It is good to know that suggested code worked for you.

Moreover, please note Aspose.Pdf .NET is a class library and it does not provide any GUI. It enables your .NET applications to read, write and manipulate PDF documents without using Adobe Acrobat. However, you may check GroupDocs.Viewer for .NET libarary, a product of our sister concern. It will help you to render and manipulate PDF in a GUI.

Please feel free to contact us for any further assistance.

Best Regards,

weibanban · August 30, 2015, 11:54pm

Dear Sirs

Do you have the reporting controls used for web pages?

Best Regards

Jeff

tilal.ahmad · August 31, 2015, 11:40pm

Hi Jeff,

Thanks for your inquiry. I am afraid Aspose.Pdf does not provide an reporting/web control. However Aspose.Cells provides a web control GridWeb, it can be used for reporting purpose. You may please consider following documentation links and demos. Hopefully it will help you to accomplish your requirements.

Working with Aspose.Cells.GridWeb

Demo:

Aspose.Cells for GridWeb Demo.

Please feel free to contact us for any further assistance.

Best Regards,

aspose.notifier · June 23, 2018, 9:01pm

The issues you have found earlier (filed as PDFNET-39178) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by asad.ali