Extract Text from PDF File Line by Line and Save Data Values inside SQL Server Database C# .NET

rdaviessci · March 10, 2021, 7:43pm

Hello, I am working on a different issue now, where I need to go through a 2700+ page PDF and extract the text from it line by line. Each line provides data values that are read and then inserted into a SQL Server database. AT68AH71___Section_99_Factor_Report_by_TRA___09_03_2020.pdf (4.1 MB)

So I need the easiest way to read through the lines, looking for titles, headers etc. and skipping them, and locating the data values so they can go into the database.
I am attaching the file I will be parsing. I am using VB.Net. Thank you!

awais.hafeez · March 11, 2021, 8:24am

@rdaviessci,

But, I see following exception upon loading this PDF with latest 21.3 version of Aspose.Words for .NET on my end.

System.IO.FileLoadException
  HResult=0x80131621
  Message=The file cannot be opened. It might have unsupported format or be corrupted.
  Source=Aspose.Words

Inner Exception 1:
InvalidOperationException: Pdf corrupt.

Inner Exception 2:
OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.

I have logged this problem in our issue tracking system. The ID of this issue is WORDSNET-21967. We will further look into the details of this problem and will keep you updated on the status of correction. Sorry for the inconvenience.

rdaviessci · March 11, 2021, 5:47pm

I am able to open it with version 21.2.0. I also have a lot of memory on my machine. Can you give me an example, or a link to an example, that just reads through the file line by line? When we used to receive these reports as .TXT files, I used to open them in .Net with StreamReader and step through the report with ReadLine. I’d like that same functionality with Aspose.PDF. Thank you!

rdaviessci · March 11, 2021, 8:00pm

Here is a smaller subset of the report. My machine RAM is 32GB so I don’t have trouble with the larger one. TRA___09_03_2020 1-100.pdf (159.3 KB)

awais.hafeez · March 12, 2021, 10:11am

@rdaviessci,

I am afraid, Aspose.Words’ PDF to Word conversion module was not designed to process such a large PDF files with 2724 pages. On our dev PC, the conversion got stuck after running for 8 minutes and consuming 17.3 GB of RAM.
We don’t have plans to optimize PDF to Word conversion module for very large PDFs at the moment.
We tested another idea with processing PDF pages one-by-one, it worked really well. Such approach only requires 234 MB of RAM and takes 12 minutes to convert all PDF pages. Here is the code that we used:

var pdfFile = "AT68AH71___Section_99_Factor_Report_by_TRA___09_03_2020.pdf";
var loadOptions = new PdfLoadOptions() { PageIndex = 0, PageCount = 1 };

for (var i = 0; i < 2724; i++)
{
    loadOptions.PageIndex = 0;
    var doc = new Document(pdfFile, loadOptions);
    doc.Save($"page_{i:D4}.docx");
}

Regarding extracting text, you can get string representations of all the Paragraphs by using following code:

Document doc = new Document("source.pdf");

foreach (Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true))
{
    if (para.ParagraphFormat.StyleIdentifier != StyleIdentifier.Heading1) // or Heading2 or Heading3
    {
        string text = para.ToString(SaveFormat.Text).Trim();
        // process this text and extract data values to store in DB
    }
}

aspose.notifier · April 11, 2021, 7:21am

The issues you have found earlier (filed as WORDSNET-21967) have been fixed in this Aspose.Words for .NET 21.4 update and this Aspose.Words for Java 21.4 update.

alexey.noskov · January 20, 2024, 6:13am

A post was split to a new topic: Convert PDF to TXT using Aspose