Unable to read the entire content from a PDF file

prnksheela · September 25, 2024, 4:49am

Extract text from PDF C#|Aspose.PDF for .NET - I am attempting to read the content of a PDF file, but I am unable to read the full content as it always returns the first line. I have tried the code samples provided above, but all yield the same result. Attached is a sample document, and this issue is reproducible for all documents.

Please assist us in reading the full content of a file up to 400MB very quickly.
Sabatier-1995.pdf (57.5 KB)

sergei.shibanov · September 25, 2024, 3:23pm

@prnksheela
When using the library version 24.09, the text from the attached document in my envirinment is extracted completely. Perhaps you are using the previous version of the library - it had very strong restrictions when using it without a license. In the latest versions, they have been relaxed, but you will not be able to read the full contents of a 400 MB file without a license.
You can take a 30-days free temporary license to evaluate the product without any limitations. Temporary License - Purchase - aspose.com

prnksheela · September 27, 2024, 6:17am

We have a Professional license from Aspose for PDF, and we are using version 23.12.

Please also provide us with the sample code that you have tried to extract the full content from the attached PDF.

sergei.shibanov · September 27, 2024, 1:23pm

@prnksheela

var pdfDocument = new Document(dataDir + "Sabatier-1995.pdf");
var textAbsorber = new TextAbsorber();            
pdfDocument.Pages.Accept(textAbsorber);
string extractedText = textAbsorber.Text;

using (var tw = new StreamWriter(dataDir + "extracted-text.txt"))
{
    tw.WriteLine(extractedText);
}

In fact, the code that is given is simply better formatted.
For the version 23.12 of the library and the document you attached, it also worked for me. But it is better to use the latest versions of the library.

Or I do not see everything - just in case, I attach the result file.
extracted-text.zip (276 Bytes)

prnksheela · October 3, 2024, 9:50am

Even when I use the above code, I am unable to read the content fully. However, after adding explicit code to invoke the license, it worked for me:

Aspose.Pdf.License license = new Aspose.Pdf.License();
license.SetLicense(“Aspose.Pdf.lic”);

Sample code used:
Aspose.Pdf.License license = new Aspose.Pdf.License();
license.SetLicense(“Aspose.Pdf.lic”);
using (var pdfDocument = new Aspose.Pdf.Document(file))
{
var textAbsorber = new Aspose.Pdf.Text.TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
string extractedText = textAbsorber.Text;
}

sergei.shibanov · October 3, 2024, 1:56pm

@prnksheela
Glad you solved the issue.