Is it possible to extract only part of pdf into excel file?

Hi There,

Does ASPOSE.Pdf have ability to extract part of pdf into excel based on the position/point?

Requirements:
Input: Provide top left and bottom right position/point for the document with page number
Output: Excel file with data between given input coordinates

Cheers, Bhavin

@james.simpson

Thanks for contacting support.

We regret that required functionality is not available at the moment. However, could you please provide a sample document along with a bit more details of your requirements. We will investigate the scenario and share our feedback with you.

Hi @asad.ali,

I want to extract tabular data which exist between below found “text” instances

Top text: “Reference:”
Bottom text: “Total Due To/(From) Person X”

Output should be in excel file like image.png (7.8 KB)

Input2ForAspose.pdf (43.2 KB)

@james.simpson

Currently Aspose.PDF offers a functionality to convert a particular region of PDF page into image. However, for your particular scenario, we have logged an investigation ticket as PDFNET-48182 in our issue tracking system. We will further investigate whether this functionality is feasible or not and keep you posted with the status of ticket resolution. Please spare us some time.

We are sorry for the inconvenience.

@james.simpson

We have further investigated the earlier-logged ticket and found that as a workaround, all text that is not in the rectangle could be deleted and Excel document will be generated without it. Please see the code snippet below.

Document document = new Document("Input2ForAspose.pdf"); 
Page page = document.Pages[1]; 
Rectangle pageRect = new Rectangle(0, 300, page.Rect.Width, 560); 
TextFragmentAbsorber absorber = new TextFragmentAbsorber(); 
page.Accept(absorber); 
foreach (TextFragment fragment in absorber.TextFragments) 
{
 if (fragment.Rectangle.Intersect(pageRect) == null)
 {
 fragment.Text = "";
 }
}
document.Save("out.xlsx", new ExcelSaveOptions());

Also, please pay attention that we found the other two issues in the output document:

  1. The second column is merged with the third.
  2. The cell on the first row is shifted left.

For fixing these errors the issues PDFNET-48677 and PDFNET-48678 have been created. We will inform you as soon as we have some additional updates in this regard.

1 Like

The issues you have found earlier (filed as PDFNET-48678) have been fixed in Aspose.PDF for .NET 22.7.