Need To parse PDF and Conversion PDF To Excel

gaurav.k · January 15, 2025, 3:39am

Hi Asad,

Can I extract text and images from a PDF? I need to extract images while maintaining their position as indicated in the PDF, meaning the images should be extracted in the same order and placement, with any text following the respective image.

asad.ali · January 15, 2025, 2:20pm

@gaurav.k

Yes, you can extract both Text and Images from PDF with position and coordinates. Please check below for more details:

// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void Extract()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open the document
    using (var document = new Aspose.Pdf.Document(dataDir + "SearchAndGetTextFromAll.pdf"))
    {

        // Create TextAbsorber object to find all instances of the input search phrase
        var textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();

        // Accept the absorber for all the pages
        document.Pages.Accept(textFragmentAbsorber);

        // Get the extracted text fragments
        var textFragmentCollection = textFragmentAbsorber.TextFragments;

        // Loop through the fragments
        foreach (var textFragment in textFragmentCollection)
        {
            Console.WriteLine("Text : {0} ", textFragment.Text);
            Console.WriteLine("Position : {0} ", textFragment.Position);
            Console.WriteLine("XIndent : {0} ", textFragment.Position.XIndent);
            Console.WriteLine("YIndent : {0} ", textFragment.Position.YIndent);
        }
    }
}

https://docs.aspose.com/pdf/net/working-with-images/

However, please note that the position (X,Y) in PDF documents can be different what they mean in other file formats like Word and Excel.

gaurav.k · January 16, 2025, 3:42am

If I extract text and images separately, how can I identify which text belongs to which image?
I need to extract them sequentially so I can print them into Excel.

asad.ali · January 16, 2025, 2:19pm

@gaurav.k

It is something that we need to investigate for the feasibility because the elements in PDF documents are stored differently. There should be some association for the text with images in order to achieve such kind of extraction. Can you please share one of your sample files so that we can log an investigation ticket in our issue management system to carry out analysis and address it accordingly.

gaurav.k · January 16, 2025, 2:38pm

As I previously shared a demo file, you can see a demo PDF and the corresponding output Excel file. If I need to generate a similar Excel file from PDF data, the text and images from the PDF should first be extracted. I can then store that extracted data into my object and use it to generate the Excel file in the required format.

AsposeSampleSourceFile.zip (445.3 KB)

gaurav.k · January 16, 2025, 3:27pm

If you are available, we can arrange a meeting. I can show you the actual requirements.

asad.ali · January 16, 2025, 7:18pm

@gaurav.k

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-59057

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

gaurav.k · January 17, 2025, 3:25am

How can I track the status of this ticket :- PDFNET - 59057 ?

gaurav.k · January 17, 2025, 6:41am

PdfTable.png (76.7 KB)

ExtrectedOutputUsingAsposeTableMethod.png (19.2 KB)

I am trying to extract table data from a PDF, where the PDF contains only 3 columns, but when I fetch the data, it is showing 4 columns instead. There are several issues like this. For reference, I have attached images where you can see that ‘PdfTable’ represents my data in the PDF, and ‘ExtractedOutputUsingAsposeTableMethod’ shows the data I am getting using the table method of Aspose.

asad.ali · January 17, 2025, 6:32pm

@gaurav.k

You can check the status of the ticket at the bottom of this forum thread where it is attached. Additionally, we will also update you as soon as we make some progress towards ticket resolution.

Are you using the API with a valid license? Please try using a 30-days free temporary license and let us know if you still notice this issue.

gaurav.k · January 27, 2025, 4:09am

What’s the status of this ticket :- PDFNET - 59057 ?

asad.ali · January 27, 2025, 11:21am

@gaurav.k

The ticket has recently been logged in our issue management system. It will be prioritized on a first come first serve basis as per free support policies. As soon as we make some progress towards its resolution, we will inform you. Please be patient and spare us some time.

We are sorry for the inconvenience.

gaurav.k · March 10, 2025, 10:30am

Any Update Mr Ali.

asad.ali · March 10, 2025, 4:29pm

@gaurav.k

We are afraid that the earlier logged ticket hasn’t been yet resolved. We will surely inform you via this forum thread once we have some certain news about its fix ETA. Please spare us some time.

We are sorry for the inconvenience.

gaurav.k · March 12, 2025, 3:32am

Hello Mr ali

Is there a way to extract images from a PDF and place them in an Excel file while maintaining their original positions relative to the text? If so, is there a solution for this?

asad.ali · March 12, 2025, 4:30pm

@gaurav.k

You can extract images from PDF independently using the API. It will give you their position as well as dimensions as well that you can use to add them in Excel file. Please note that position of objects in a PDF file may be different than in the Excel file because PDF and Excel file formats are quite different than one another. To add images, in Excel file, you can check Aspose.Cells API.

gaurav.k · March 21, 2025, 4:12am

Hello MR Ali,

My Temporary license was expire i need more to invest to fetch Images and table can you re-new my temporary license for some more time.

asad.ali · March 21, 2025, 6:24pm

@gaurav.k

You can post a renewal or extension request in our Purchase Forum and they will assist you there accordingly.

gaurav.k · March 31, 2025, 3:52am

I am trying to convert a PDF to Excel, but the file is not being saved and no error is shown. Below is my code. Please check the issue and correct it.

Aspose.Pdf.Document pdf = new Aspose.Pdf.Document(excelPath);

pdf.Pages.Delete(startPage - 1);

ExcelSaveOptions exl = new ExcelSaveOptions
{
Format = ExcelSaveOptions.ExcelFormat.XLSX,
InsertBlankColumnAtFirst = false,
MinimizeTheNumberOfWorksheets = true
};
pdf.Save(excelPath, exl);
pdf.Dispose();

asad.ali · March 31, 2025, 5:58pm

@gaurav.k

Do you mean that the program hangs? Can you please share the sample PDF?