Need To parse PDF and Conversion PDF To Excel

gaurav.k · December 20, 2024, 4:37am

My requirement is as follows:
I have a PDF that contains text, along with images and tables. I need to create an Excel file based on this PDF, where the data is organized exactly as it appears in the PDF.

Is there a solution provided by Aspose that allows me to extract text, images, and tables from the PDF and replicate them in the Excel file?

asad.ali · December 20, 2024, 11:15pm

@gaurav.k

Aspose.PDF provides feature to extract text as well as images. However, you need to extract them in specific manners that could assist you in generating Excel file using Aspose.Cells. This particular feature needs an investigation. On the other hand, you can directly generate Excel file from source PDF using its conversion capability if it suits you. Please share a sample source and expected output Excel so that we can analyze the scenario accordingly.

gaurav.k · December 23, 2024, 5:58am

Dear Asad,

I have attached a sample source document with the expected output. Please check and let me know if it is possible to fetch data like this using Aspose. The input is a PDF file, and the expected output can be seen in the attached Excel file.

AsposeSampleSourceFile.zip (445.3 KB)

asad.ali · December 23, 2024, 4:56pm

@gaurav.k

Can you please check below attached Excel file that we generated in our environment while using below code snippet and let us know if it can work for you?

Aspose.Pdf.Document pdf = new Aspose.Pdf.Document(dataDir + "AsposeDemoPDF_00.pdf");
Aspose.Pdf.ExcelSaveOptions exl = new ExcelSaveOptions();

//exl.ConversionEngine = ExcelSaveOptions.ConversionEngines.NewEngine;
exl.Format = ExcelSaveOptions.ExcelFormat.XLSX;
exl.InsertBlankColumnAtFirst = false;
exl.MinimizeTheNumberOfWorksheets = true;
pdf.Save(dataDir + "AsposeDemoPDF_00.xlsx", exl);

AsposeDemoPDF_00.zip (25.4 KB)

gaurav.k · December 24, 2024, 4:38am

Hello Asad,

I tried that code, and it is working to convert the PDF to Excel.
First, the code is not able to convert the entire PDF; I think it works on only 3 to 5 pages.
Second, if I need to manipulate the converted Excel file, such as removing the header or footer, is there an option to do that as well?

Thanks and regards
Gaurav

asad.ali · December 24, 2024, 3:00pm

@gaurav.k

You are facing this limitation because of the trial version. You can please use a free 30-days temporary license to remove the limitation. Please check below article for setting a license before conversion:

Aspose PDF License|Aspose.PDF for .NET

You will need to use Aspose.Cells for that. You can create an inquiry in Aspose.Cells forum category where you will be assisted accordingly.

gaurav.k · January 2, 2025, 5:18am

Okay Mr Ali,

I will try with a free temporary license.
If I have any queries, I will reply in the same thread.

asad.ali · January 2, 2025, 8:26pm

A post was split to a new topic: Unable to receive temporary license after applying it

gaurav.k · January 6, 2025, 6:54am

Hello Asad,

I am still not getting any ubdate regarding my temporary license.

As you Split that topic
A post was split to a new topic: Unable to receive temporary license after applying it

I am waiting for update.

asad.ali · January 6, 2025, 1:53pm

@gaurav.k

We have moved the inquiry to Purchase forum which is the right place for such questions. Relevant team will surely update you shortly as we have raised your concerns internally. You may receive a reply in the dedicated thread. We apologize for the trouble you may have faced.

gaurav.k · January 8, 2025, 9:11am

Hello Asad,

I tried using a temporary license, and we successfully converted the PDF to Excel. However, it would be very helpful if I could fetch all the PDF data and generate the Excel file according to my requirements. Currently, it directly converts the PDF to Excel, but I want to manually generate the Excel file.

asad.ali · January 8, 2025, 1:56pm

@gaurav.k

You will need to use Aspose.Cells to create Excel files manually from scratch as Aspose.PDF is not built to create them. You can extract text, images as well as tables from PDF documents using Aspose.PDF and then use Aspose.Cells to create Excel file as per your needs. Please check below helpful documentation articles:

gaurav.k · January 9, 2025, 4:26am

Hello Asad,

As I am converting a PDF to Excel, sometimes I want the conversion to start from a specific PDF page number. For example, sometimes I want it to start from page number 3. Can I start the conversion from PDF page number 4?

gaurav.k · January 9, 2025, 4:27am

The PDF has 20 pages, and I want it to start from page number 4 and continue to the last page.

asad.ali · January 9, 2025, 12:18pm

@gaurav.k

You will need to split the PDF document and extract particular pages into new Document instance and then convert it into Excel file. Please check below articles to extract the PDF pages and create new document:

Move PDF Pages programmatically C#|Aspose.PDF for .NET

gaurav.k · January 15, 2025, 3:39am

Hi Asad,

Can I extract text and images from a PDF? I need to extract images while maintaining their position as indicated in the PDF, meaning the images should be extracted in the same order and placement, with any text following the respective image.

asad.ali · January 15, 2025, 2:20pm

@gaurav.k

Yes, you can extract both Text and Images from PDF with position and coordinates. Please check below for more details:

// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void Extract()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open the document
    using (var document = new Aspose.Pdf.Document(dataDir + "SearchAndGetTextFromAll.pdf"))
    {

        // Create TextAbsorber object to find all instances of the input search phrase
        var textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();

        // Accept the absorber for all the pages
        document.Pages.Accept(textFragmentAbsorber);

        // Get the extracted text fragments
        var textFragmentCollection = textFragmentAbsorber.TextFragments;

        // Loop through the fragments
        foreach (var textFragment in textFragmentCollection)
        {
            Console.WriteLine("Text : {0} ", textFragment.Text);
            Console.WriteLine("Position : {0} ", textFragment.Position);
            Console.WriteLine("XIndent : {0} ", textFragment.Position.XIndent);
            Console.WriteLine("YIndent : {0} ", textFragment.Position.YIndent);
        }
    }
}

https://docs.aspose.com/pdf/net/working-with-images/

However, please note that the position (X,Y) in PDF documents can be different what they mean in other file formats like Word and Excel.

gaurav.k · January 16, 2025, 3:42am

If I extract text and images separately, how can I identify which text belongs to which image?
I need to extract them sequentially so I can print them into Excel.

asad.ali · January 16, 2025, 2:19pm

@gaurav.k

It is something that we need to investigate for the feasibility because the elements in PDF documents are stored differently. There should be some association for the text with images in order to achieve such kind of extraction. Can you please share one of your sample files so that we can log an investigation ticket in our issue management system to carry out analysis and address it accordingly.

gaurav.k · January 16, 2025, 2:38pm

As I previously shared a demo file, you can see a demo PDF and the corresponding output Excel file. If I need to generate a similar Excel file from PDF data, the text and images from the PDF should first be extracted. I can then store that extracted data into my object and use it to generate the Excel file in the required format.

AsposeSampleSourceFile.zip (445.3 KB)