Fastest method to get just the text from a document

I need to get all the text content from a PDF. I do like this:

Document doc = new Document(path);
string text = doc.GetText();

This works but my issue is that for documents with a lot of large images, the new Document() is really slow.

I have a 2MB PDF that takes 8 seconds for just “new Document()” to finish:

https://cdn.filestar.com/uploads/217ae964-ebd9-4d92-a799-ea728b2880a4/upload.zip

Is there a way to load the document faster, ignoring images etc?

@nielsbosma I am afraid there is no way to make PDF document loading faster using Aspose.Words. Please note, Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model. On the other hand PDF documents are fixed page format documents . While loading PDF Aspose.Words needs to convert Fixed Page Document structure into the Flow Document Object Model, which is quire resource consuming operation. if you need to deal with PDF documents, you can consider using Aspose.PDF, which is designed to work with PDF documents.

Also, I would suggest to use the following code to get text of the document:

string text = doc.ToString(SaveFormat.Text);
1 Like

Ok, thanks. I will use AsposePdf for pdf documents instead.

1 Like