Hello, I am working on a different issue now, where I need to go through a 2700+ page PDF and extract the text from it line by line. Each line provides data values that are read and then inserted into a SQL Server database. AT68AH71___Section_99_Factor_Report_by_TRA___09_03_2020.pdf (4.1 MB)
So I need the easiest way to read through the lines, looking for titles, headers etc. and skipping them, and locating the data values so they can go into the database.
I am attaching the file I will be parsing. I am using VB.Net
. Thank you!
@rdaviessci,
But, I see following exception upon loading this PDF with latest 21.3 version of Aspose.Words for .NET on my end.
System.IO.FileLoadException
HResult=0x80131621
Message=The file cannot be opened. It might have unsupported format or be corrupted.
Source=Aspose.Words
Inner Exception 1:
InvalidOperationException: Pdf corrupt.
Inner Exception 2:
OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
I have logged this problem in our issue tracking system. The ID of this issue is WORDSNET-21967. We will further look into the details of this problem and will keep you updated on the status of correction. Sorry for the inconvenience.
I am able to open it with version 21.2.0. I also have a lot of memory on my machine. Can you give me an example, or a link to an example, that just reads through the file line by line? When we used to receive these reports as .TXT files, I used to open them in .Net with StreamReader and step through the report with ReadLine. I’d like that same functionality with Aspose.PDF. Thank you!
Here is a smaller subset of the report. My machine RAM is 32GB so I don’t have trouble with the larger one. TRA___09_03_2020 1-100.pdf (159.3 KB)
@rdaviessci,
- I am afraid, Aspose.Words’ PDF to Word conversion module was not designed to process such a large PDF files with 2724 pages. On our dev PC, the conversion got stuck after running for 8 minutes and consuming 17.3 GB of RAM.
- We don’t have plans to optimize PDF to Word conversion module for very large PDFs at the moment.
- We tested another idea with processing PDF pages one-by-one, it worked really well. Such approach only requires 234 MB of RAM and takes 12 minutes to convert all PDF pages. Here is the code that we used:
var pdfFile = "AT68AH71___Section_99_Factor_Report_by_TRA___09_03_2020.pdf";
var loadOptions = new PdfLoadOptions() { PageIndex = 0, PageCount = 1 };
for (var i = 0; i < 2724; i++)
{
loadOptions.PageIndex = 0;
var doc = new Document(pdfFile, loadOptions);
doc.Save($"page_{i:D4}.docx");
}
Regarding extracting text, you can get string representations of all the Paragraphs by using following code:
Document doc = new Document("source.pdf");
foreach (Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true))
{
if (para.ParagraphFormat.StyleIdentifier != StyleIdentifier.Heading1) // or Heading2 or Heading3
{
string text = para.ToString(SaveFormat.Text).Trim();
// process this text and extract data values to store in DB
}
}
The issues you have found earlier (filed as WORDSNET-21967) have been fixed in this Aspose.Words for .NET 21.4 update and this Aspose.Words for Java 21.4 update.
A post was split to a new topic: Convert PDF to TXT using Aspose