@rdaviessci,
- I am afraid, Aspose.Words’ PDF to Word conversion module was not designed to process such a large PDF files with 2724 pages. On our dev PC, the conversion got stuck after running for 8 minutes and consuming 17.3 GB of RAM.
- We don’t have plans to optimize PDF to Word conversion module for very large PDFs at the moment.
- We tested another idea with processing PDF pages one-by-one, it worked really well. Such approach only requires 234 MB of RAM and takes 12 minutes to convert all PDF pages. Here is the code that we used:
var pdfFile = "AT68AH71___Section_99_Factor_Report_by_TRA___09_03_2020.pdf";
var loadOptions = new PdfLoadOptions() { PageIndex = 0, PageCount = 1 };
for (var i = 0; i < 2724; i++)
{
loadOptions.PageIndex = 0;
var doc = new Document(pdfFile, loadOptions);
doc.Save($"page_{i:D4}.docx");
}
Regarding extracting text, you can get string representations of all the Paragraphs by using following code:
Document doc = new Document("source.pdf");
foreach (Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true))
{
if (para.ParagraphFormat.StyleIdentifier != StyleIdentifier.Heading1) // or Heading2 or Heading3
{
string text = para.ToString(SaveFormat.Text).Trim();
// process this text and extract data values to store in DB
}
}