I am trying to read the data from the table from PDF document which have tab separated columns instead of box format. Table_format.jpg (92.0 KB). But TableAbsorber unable to read content of the table. Can someone help me to read data from the table.
@sakalasiva
This possibility was not provided.
But perhaps additional features have been added to the library. If I had a file I would try with the setted:
tableAbsorber.TextSearchOptions.Rectangle
// and
tableAbsorber.UseFlowEngine = true;
If this does not work, then you can do it yourself by getting the text in a given rectangle and dividing it by tabs and y values for the found fragments.
Hi, I tried what you suggested Thanks for you for that. But I am facing issue with limiting search area to get better results. I see rectangle value is not honored while scanning the document. Following is my code. Can you help me what wrong I am doing here. Thanks in advance.
TableAbsorber tableAbsorber = new TableAbsorber();
tableAbsorber.TextSearchOptions = new TextSearchOptions(new Rectangle(20, 30, 40, 70));
tableAbsorber.TextSearchOptions.LimitToPageBounds = true;
tableAbsorber.UseFlowEngine = true;
tableAbsorber.Visit(((PDFDocument)pDFDocument).PdfReader.Pages[criteria.Page]);
ICollection collection = tableAbsorber.TableList?.Select((Func<AbsorbedTable, IPDFTable>)((x) => new AsposePDFTable
{
PageNo = criteria.Page,
PdfTable = x
})).ToList();
@sakalasiva
you can try this
tableAbsorber.TextSearchOptions = new TextSearchOptions(new Rectangle(20, 30, 40, 70));
instead
tableAbsorber.TextSearchOptions.Rectangle = new Rectangle(20, 30, 40, 70);
But it should also work in your version.
Please attach the document you used so that we can check and reproduce this error.
SalesAndTaxes.pdf (278.4 KB)
Hi, I am attaching the sample document for testing. Can you verify and let me know what I can do to get it correct. I am interested reading description of property column.
TableAbsorber tableAbsorber = new TableAbsorber();
tableAbsorber.TextSearchOptions = new TextSearchOptions(true);
tableAbsorber.TextSearchOptions.Rectangle = new Rectangle(183.05000305175781, 39.596485137939453, 596.1500244140625, 53.190235137939453);
tableAbsorber.TextSearchOptions.LimitToPageBounds = true;
tableAbsorber.UseFlowEngine = true;
tableAbsorber.Visit(((PDFDocument)pDFDocument).PdfReader.Pages[criteria.Page]);
ICollection collection = tableAbsorber.TableList?.Select((Func<AbsorbedTable, IPDFTable>)((x) => new AsposePDFTable
{
PageNo = criteria.Page,
PdfTable = x
})).ToList();
@sakalasiva
Thank you for attaching the document. I studied it - there is no text, and all the letters are drawn in graphics.
image.png (84.0 KB)
Therefore, classes for working with text do not find anything. You should use a GraphicAbsorber object (you can get a SubpathCollection with it) - although this will not be text, but graphics in essence.
Sorry I think PDF file format changed while printing specific page. I am uploading again with correct file. Please have a look and help me.
SalesAndTaxes.pdf (110.6 KB)
This document provides the text. I’ll look into it and write to you later.
@sakalasiva
Using the code
var doc = new Document(dataDir + "SalesAndTaxes.pdf");
var tfa = new TextFragmentAbsorber();
tfa.Visit(doc.Pages[1]);
foreach (TextFragment textFragment in tfa.TextFragments)
{
if(!string.IsNullOrWhiteSpace(textFragment.Text))
Console.WriteLine($"{textFragment.Text} at ({textFragment.Position.XIndent},{textFragment.Position.YIndent})");
}
I got text data as output.
Form ID.docx (13.4 KB)
True, this is only available if you have a license (without a license, only 4 elements will be issued).
Operating with the values (X,Y) of the resulting fragments, you can select the necessary lines and compose them as you need.
Thanks for the reply