Need To parse PDF and Conversion PDF To Excel

My requirement is as follows:
I have a PDF that contains text, along with images and tables. I need to create an Excel file based on this PDF, where the data is organized exactly as it appears in the PDF.

Is there a solution provided by Aspose that allows me to extract text, images, and tables from the PDF and replicate them in the Excel file?

@gaurav.k

Aspose.PDF provides feature to extract text as well as images. However, you need to extract them in specific manners that could assist you in generating Excel file using Aspose.Cells. This particular feature needs an investigation. On the other hand, you can directly generate Excel file from source PDF using its conversion capability if it suits you. Please share a sample source and expected output Excel so that we can analyze the scenario accordingly.

Dear Asad,

I have attached a sample source document with the expected output. Please check and let me know if it is possible to fetch data like this using Aspose. The input is a PDF file, and the expected output can be seen in the attached Excel file.

AsposeSampleSourceFile.zip (445.3 KB)

@gaurav.k

Can you please check below attached Excel file that we generated in our environment while using below code snippet and let us know if it can work for you?

Aspose.Pdf.Document pdf = new Aspose.Pdf.Document(dataDir + "AsposeDemoPDF_00.pdf");
Aspose.Pdf.ExcelSaveOptions exl = new ExcelSaveOptions();

//exl.ConversionEngine = ExcelSaveOptions.ConversionEngines.NewEngine;
exl.Format = ExcelSaveOptions.ExcelFormat.XLSX;
exl.InsertBlankColumnAtFirst = false;
exl.MinimizeTheNumberOfWorksheets = true;
pdf.Save(dataDir + "AsposeDemoPDF_00.xlsx", exl);

AsposeDemoPDF_00.zip (25.4 KB)

Hello Asad,

I tried that code, and it is working to convert the PDF to Excel.
First, the code is not able to convert the entire PDF; I think it works on only 3 to 5 pages.
Second, if I need to manipulate the converted Excel file, such as removing the header or footer, is there an option to do that as well?

Thanks and regards
Gaurav

@gaurav.k

You are facing this limitation because of the trial version. You can please use a free 30-days temporary license to remove the limitation. Please check below article for setting a license before conversion:

You will need to use Aspose.Cells for that. You can create an inquiry in Aspose.Cells forum category where you will be assisted accordingly.

Okay Mr Ali,

I will try with a free temporary license.
If I have any queries, I will reply in the same thread.

A post was split to a new topic: Unable to receive temporary license after applying it

Hello Asad,

I am still not getting any ubdate regarding my temporary license.

As you Split that topic
A post was split to a new topic: Unable to receive temporary license after applying it

I am waiting for update.

@gaurav.k

We have moved the inquiry to Purchase forum which is the right place for such questions. Relevant team will surely update you shortly as we have raised your concerns internally. You may receive a reply in the dedicated thread. We apologize for the trouble you may have faced.

Hello Asad,

I tried using a temporary license, and we successfully converted the PDF to Excel. However, it would be very helpful if I could fetch all the PDF data and generate the Excel file according to my requirements. Currently, it directly converts the PDF to Excel, but I want to manually generate the Excel file.

@gaurav.k

You will need to use Aspose.Cells to create Excel files manually from scratch as Aspose.PDF is not built to create them. You can extract text, images as well as tables from PDF documents using Aspose.PDF and then use Aspose.Cells to create Excel file as per your needs. Please check below helpful documentation articles:

Hello Asad,

As I am converting a PDF to Excel, sometimes I want the conversion to start from a specific PDF page number. For example, sometimes I want it to start from page number 3. Can I start the conversion from PDF page number 4?

The PDF has 20 pages, and I want it to start from page number 4 and continue to the last page.

@gaurav.k

You will need to split the PDF document and extract particular pages into new Document instance and then convert it into Excel file. Please check below articles to extract the PDF pages and create new document:

Hi Asad,

Can I extract text and images from a PDF? I need to extract images while maintaining their position as indicated in the PDF, meaning the images should be extracted in the same order and placement, with any text following the respective image.

@gaurav.k

Yes, you can extract both Text and Images from PDF with position and coordinates. Please check below for more details:

// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void Extract()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open the document
    using (var document = new Aspose.Pdf.Document(dataDir + "SearchAndGetTextFromAll.pdf"))
    {

        // Create TextAbsorber object to find all instances of the input search phrase
        var textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();

        // Accept the absorber for all the pages
        document.Pages.Accept(textFragmentAbsorber);

        // Get the extracted text fragments
        var textFragmentCollection = textFragmentAbsorber.TextFragments;

        // Loop through the fragments
        foreach (var textFragment in textFragmentCollection)
        {
            Console.WriteLine("Text : {0} ", textFragment.Text);
            Console.WriteLine("Position : {0} ", textFragment.Position);
            Console.WriteLine("XIndent : {0} ", textFragment.Position.XIndent);
            Console.WriteLine("YIndent : {0} ", textFragment.Position.YIndent);
        }
    }
}

However, please note that the position (X,Y) in PDF documents can be different what they mean in other file formats like Word and Excel.

If I extract text and images separately, how can I identify which text belongs to which image?
I need to extract them sequentially so I can print them into Excel.

@gaurav.k

It is something that we need to investigate for the feasibility because the elements in PDF documents are stored differently. There should be some association for the text with images in order to achieve such kind of extraction. Can you please share one of your sample files so that we can log an investigation ticket in our issue management system to carry out analysis and address it accordingly.

As I previously shared a demo file, you can see a demo PDF and the corresponding output Excel file. If I need to generate a similar Excel file from PDF data, the text and images from the PDF should first be extracted. I can then store that extracted data into my object and use it to generate the Excel file in the required format.

AsposeSampleSourceFile.zip (445.3 KB)