Not able to extract text from large pdf files

abdulkadirsabirbohari · July 11, 2023, 8:32am

I have large pdf file around pages of 92k , not able to extract the text , getting memory out of exception , look into this tried following code
code 1:

				    `foreach (Page pdfPage in pdfDocument.Pages)
				{
					using (MemoryStream textStream = new MemoryStream())
					{
						// Create text device
						TextDevice textDevice = new TextDevice(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));

						// Convert a particular page and save text to the stream
						textDevice.Process(pdfPage, textStream);
						textStream.Close();
						pdfText += $"{Encoding.Unicode.GetString(textStream.ToArray())} ";
					}

					pdfPage.Dispose();
				}
			}

`

code 2:

// Open document
			using (Document pdfDocument = new Document(pdfPath))
			{
				// Create TextAbsorber object to extract text
				TextAbsorber textAbsorber = new TextAbsorber();

				// Accept the absorber for all the pages
				pdfDocument.Pages.Accept(textAbsorber);

				// Get the extracted text
				return textAbsorber.Text;
			}

not able to attached the pdf here

sergei.shibanov · July 11, 2023, 9:31am

@abdulkadirsabirbohari
To reproduce the issue, I need the file with which this happens. You can upload it to the cloud and attach the link here.

abdulkadirsabirbohari · July 11, 2023, 11:44am

you can download from this link

sergei.shibanov · July 11, 2023, 2:32pm

@abdulkadirsabirbohari
Thank you that attached the file.
I downloaded and unpacked it. I got error messages when viewing in Adobe Acrobat (Error.png (39.0 KB) page 37291 is one example). I checked the file in Adobe Preflight and it showed a similar error.
Adobe Preflight.jpg (196.3 KB)

Thus, the reason is in the damaged file.

abdulkadirsabirbohari · July 17, 2023, 5:25am

if possible for you to take same example and try it, i can view pdf in Adobe Acrobat . please provide any working solution .

sergei.shibanov · July 17, 2023, 3:04pm

When you go to some pages (for example, to 37291), an error message is displayed. You can try processing page by page and then concatenate the results for those pages whose processing was successful.

abdulkadirsabirbohari · July 18, 2023, 7:00am

can you share code sample for the same , so I can try it out at my end

sergei.shibanov · July 18, 2023, 10:05am

@abdulkadirsabirbohari
For first code snippet use:
Page pdfPage = pdfDocument.Pages[i]
with for loop

for second code snippet use:
pdfDocument.Pages[i].Accept(textAbsorber); // were i variable of for loop
instead
pdfDocument.Pages.Accept(textAbsorber);