My requirement is as follows:
I have a PDF that contains text, along with images and tables. I need to create an Excel file based on this PDF, where the data is organized exactly as it appears in the PDF.
Is there a solution provided by Aspose that allows me to extract text, images, and tables from the PDF and replicate them in the Excel file?
Aspose.PDF provides feature to extract text as well as images. However, you need to extract them in specific manners that could assist you in generating Excel file using Aspose.Cells. This particular feature needs an investigation. On the other hand, you can directly generate Excel file from source PDF using its conversion capability if it suits you. Please share a sample source and expected output Excel so that we can analyze the scenario accordingly.
I have attached a sample source document with the expected output. Please check and let me know if it is possible to fetch data like this using Aspose. The input is a PDF file, and the expected output can be seen in the attached Excel file.
Can you please check below attached Excel file that we generated in our environment while using below code snippet and let us know if it can work for you?
Aspose.Pdf.Document pdf = new Aspose.Pdf.Document(dataDir + "AsposeDemoPDF_00.pdf");
Aspose.Pdf.ExcelSaveOptions exl = new ExcelSaveOptions();
//exl.ConversionEngine = ExcelSaveOptions.ConversionEngines.NewEngine;
exl.Format = ExcelSaveOptions.ExcelFormat.XLSX;
exl.InsertBlankColumnAtFirst = false;
exl.MinimizeTheNumberOfWorksheets = true;
pdf.Save(dataDir + "AsposeDemoPDF_00.xlsx", exl);
I tried that code, and it is working to convert the PDF to Excel.
First, the code is not able to convert the entire PDF; I think it works on only 3 to 5 pages.
Second, if I need to manipulate the converted Excel file, such as removing the header or footer, is there an option to do that as well?
You are facing this limitation because of the trial version. You can please use a free 30-days temporary license to remove the limitation. Please check below article for setting a license before conversion:
We have moved the inquiry to Purchase forum which is the right place for such questions. Relevant team will surely update you shortly as we have raised your concerns internally. You may receive a reply in the dedicated thread. We apologize for the trouble you may have faced.
I tried using a temporary license, and we successfully converted the PDF to Excel. However, it would be very helpful if I could fetch all the PDF data and generate the Excel file according to my requirements. Currently, it directly converts the PDF to Excel, but I want to manually generate the Excel file.
You will need to use Aspose.Cells to create Excel files manually from scratch as Aspose.PDF is not built to create them. You can extract text, images as well as tables from PDF documents using Aspose.PDF and then use Aspose.Cells to create Excel file as per your needs. Please check below helpful documentation articles:
As I am converting a PDF to Excel, sometimes I want the conversion to start from a specific PDF page number. For example, sometimes I want it to start from page number 3. Can I start the conversion from PDF page number 4?
You will need to split the PDF document and extract particular pages into new Document instance and then convert it into Excel file. Please check below articles to extract the PDF pages and create new document:
Can I extract text and images from a PDF? I need to extract images while maintaining their position as indicated in the PDF, meaning the images should be extracted in the same order and placement, with any text following the respective image.
Yes, you can extract both Text and Images from PDF with position and coordinates. Please check below for more details:
// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void Extract()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf_Text();
// Open the document
using (var document = new Aspose.Pdf.Document(dataDir + "SearchAndGetTextFromAll.pdf"))
{
// Create TextAbsorber object to find all instances of the input search phrase
var textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
// Accept the absorber for all the pages
document.Pages.Accept(textFragmentAbsorber);
// Get the extracted text fragments
var textFragmentCollection = textFragmentAbsorber.TextFragments;
// Loop through the fragments
foreach (var textFragment in textFragmentCollection)
{
Console.WriteLine("Text : {0} ", textFragment.Text);
Console.WriteLine("Position : {0} ", textFragment.Position);
Console.WriteLine("XIndent : {0} ", textFragment.Position.XIndent);
Console.WriteLine("YIndent : {0} ", textFragment.Position.YIndent);
}
}
}
If I extract text and images separately, how can I identify which text belongs to which image?
I need to extract them sequentially so I can print them into Excel.
It is something that we need to investigate for the feasibility because the elements in PDF documents are stored differently. There should be some association for the text with images in order to achieve such kind of extraction. Can you please share one of your sample files so that we can log an investigation ticket in our issue management system to carry out analysis and address it accordingly.
As I previously shared a demo file, you can see a demo PDF and the corresponding output Excel file. If I need to generate a similar Excel file from PDF data, the text and images from the PDF should first be extracted. I can then store that extracted data into my object and use it to generate the Excel file in the required format.
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.