Parallel TextFragmentAbsorber throwing an exception

bvk · January 20, 2020, 11:18pm

Hi all,

I am trying to process documents at the same time and have found recently that TextFragmentAbsorber seems to be throwing an exception when being used in parallel on medium-larger documents (over ~300kb). Example code with sample files attached:

var documents = new[]
{
	@"Sample1.pdf",
	@"Sample2.pdf"
};

Parallel.ForEach(documents, filePath =>
{
	var document = new Document(filePath);
	var textFragmentAbsorber = new TextFragmentAbsorber();
	foreach (var page in document.Pages)
	{
		page.Accept(textFragmentAbsorber);
	}
});

Thanks!
Pdf_Lock_Documents.zip (634.3 KB)

asad.ali · January 21, 2020, 11:17am

@bvk

Thanks for contacting support.

Please note that Aspose.PDF is multi-threaded safe API which means it supports multi-threading as long as one document is being accessed by one thread at a time. In other words, single document should be processed in one thread only.

Parallel Foreach method executes multiple iterations at the same time on different processors or processor cores and may open the possibility of synchronization problems. That is why, we do not recommend this approach to process multiple PDF files. Please try to implement one thread per PDF document approach and in case you still face any issue, please feel free to let us know.

bvk · January 21, 2020, 6:53pm

@asad.ali,

Is there a way you recommend to use Aspose.Pdf in a multi-threaded manner? I have tried another couple ways of multithreading and still run into the same exception.

Sample code using tasks:

var documents = new[]
{
	@"Sample1.pdf",
	@"Sample2.pdf"
};

var tasks = new List<Task>();
foreach (var filePath in documents)
{
	var task = new Task(() =>
	{
		var document = new Document(filePath);
		var textFragmentAbsorber = new TextFragmentAbsorber();
		foreach (var page in document.Pages)
		{
			page.Accept(textFragmentAbsorber);
		}
	});
	task.Start();
	tasks.Add(task);
}

Task.WaitAll(tasks.ToArray());

Sample code using threads:

var documents = new[]
{
	@"Sample1.pdf",
	@"Sample2.pdf"
};

var threads = new List<Thread>();
foreach (var filePath in documents)
{
	var thread = new Thread(() =>
	{
		var document = new Document(filePath);
		var textFragmentAbsorber = new TextFragmentAbsorber();
		foreach (var page in document.Pages)
		{
			page.Accept(textFragmentAbsorber);
		}
	});

	thread.Start();
	threads.Add(thread);
}

foreach (var thread in threads)
{
	thread.Join();
}

Thanks!

asad.ali · January 22, 2020, 12:47pm

@bvk

We will surely prepare a recommended example for you to use API in multi-threaded environment and add respective articles in API documentation as well. The corresponding task has been logged PDFNET-47604 in our issue tracking system and we will surely let you know as soon as it is closed. Please spare us some time.

We are sorry for the inconvenience.

bvk · January 22, 2020, 3:55pm

@asad.ali,

Upon a little further investigation this seems to be happening particularly with documents that have text in a right to left language such as the Hebrew documents I attached. I noticed multithreading working on on documents that were all Left to Right formatted, but in pdfs that had languages that were Right to Left (e.g. Hebrew, Arabic) any parallelization attempt throws an exception.

asad.ali · January 22, 2020, 5:27pm

@bvk

We have updated the issue details according to your latest comments and will surely inform you as soon as some updates are available. Please spare us some time.

bvk · March 27, 2020, 4:21pm

@asad.ali,

I was wondering if there has been any update on this? Specifically with parallelizing Right-to-Left formatted documents?

asad.ali · March 27, 2020, 7:43pm

@bvk

Investigation has almost been completed. It was found a serious flaw in processing of right-to-left text. The only workaround we can recommend now is to avoid scenario of parallel processing of two (and more) documents with a large amount of right-to-left written text.

The final solution requires a rewriting a large volume of code and it may take several months. We will share further news with you as soon as we have some. Please give us some time.

We are sorry for the inconvenience.