Can I extract text and images from a PDF? I need to extract images while maintaining their position as indicated in the PDF, meaning the images should be extracted in the same order and placement, with any text following the respective image.
Yes, you can extract both Text and Images from PDF with position and coordinates. Please check below for more details:
// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void Extract()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf_Text();
// Open the document
using (var document = new Aspose.Pdf.Document(dataDir + "SearchAndGetTextFromAll.pdf"))
{
// Create TextAbsorber object to find all instances of the input search phrase
var textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
// Accept the absorber for all the pages
document.Pages.Accept(textFragmentAbsorber);
// Get the extracted text fragments
var textFragmentCollection = textFragmentAbsorber.TextFragments;
// Loop through the fragments
foreach (var textFragment in textFragmentCollection)
{
Console.WriteLine("Text : {0} ", textFragment.Text);
Console.WriteLine("Position : {0} ", textFragment.Position);
Console.WriteLine("XIndent : {0} ", textFragment.Position.XIndent);
Console.WriteLine("YIndent : {0} ", textFragment.Position.YIndent);
}
}
}
If I extract text and images separately, how can I identify which text belongs to which image?
I need to extract them sequentially so I can print them into Excel.
It is something that we need to investigate for the feasibility because the elements in PDF documents are stored differently. There should be some association for the text with images in order to achieve such kind of extraction. Can you please share one of your sample files so that we can log an investigation ticket in our issue management system to carry out analysis and address it accordingly.
As I previously shared a demo file, you can see a demo PDF and the corresponding output Excel file. If I need to generate a similar Excel file from PDF data, the text and images from the PDF should first be extracted. I can then store that extracted data into my object and use it to generate the Excel file in the required format.
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFNET-59057
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.
I am trying to extract table data from a PDF, where the PDF contains only 3 columns, but when I fetch the data, it is showing 4 columns instead. There are several issues like this. For reference, I have attached images where you can see that ‘PdfTable’ represents my data in the PDF, and ‘ExtractedOutputUsingAsposeTableMethod’ shows the data I am getting using the table method of Aspose.
You can check the status of the ticket at the bottom of this forum thread where it is attached. Additionally, we will also update you as soon as we make some progress towards ticket resolution.
Are you using the API with a valid license? Please try using a 30-days free temporary license and let us know if you still notice this issue.
The ticket has recently been logged in our issue management system. It will be prioritized on a first come first serve basis as per free support policies. As soon as we make some progress towards its resolution, we will inform you. Please be patient and spare us some time.
We are afraid that the earlier logged ticket hasn’t been yet resolved. We will surely inform you via this forum thread once we have some certain news about its fix ETA. Please spare us some time.
Is there a way to extract images from a PDF and place them in an Excel file while maintaining their original positions relative to the text? If so, is there a solution for this?
You can extract images from PDF independently using the API. It will give you their position as well as dimensions as well that you can use to add them in Excel file. Please note that position of objects in a PDF file may be different than in the Excel file because PDF and Excel file formats are quite different than one another. To add images, in Excel file, you can check Aspose.Cells API.
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.