We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Aspose.PDF Parallelization issue on extracting text

Hello.

Im opening this ticket for a corporate customer which Im working for, who has a license for nuget version 20.3 Aspose.PDF.

There seems to be an issue while trying to extract text in parallel from documents containing right to left text (hebrew text) for which I also found some related topics on ticket PDFNET-47604.
I have searched through the release notes if this topic had been fixed but didn`t find anything.

I`m using Task Parallel Library to create a pipeline in which text is extracted from PDF documents in parallel. (parallelization degree starting from 2 to half processor count is causing issues)

The method which is calling Aspose to extract text is attached below, next to the exception.

The method is not causing issues in a single threaded environment.
I checked if any issues might come from how we handle parallelism and it`s safe to say that we send the correct streams in parallel, with the correct position in them. (starting point 0)

Attached is a file that is causing the issue. It can be read from in parallel multiple times (under different names) for reproduction

Method
private string GetTextFromStream(Stream stream)
{
using (Document pdfDocument = new Document(stream))
{
TextAbsorber textAbsorber = new TextAbsorber();
// Accept the absorber for all the pages
pdfDocument.Pages.Accept(textAbsorber);

            return textAbsorber.Text;
        }
    }

Exception
The exception that I`m receiving is the following:
System.IndexOutOfRangeException: Index was outside the bounds of the array.
at #=zQTThJ8SKC3ertqDnRus0uU8CWygoEl$dOw==.#=zuAI1aQWzuSG$JnUezQ==(String #=zQLozSMs=, Boolean #=z986EvMzS5RWhU3Ec8g==)
at #=zvUvDSaHMGFaxsrY5mkR01u15_BJTh9iTiA==.#=z_mejqdA=(String #=zkdyn77I=, Boolean #=z8IirF1pNGfzaHWTBGA==)
at #=z$qK8V0DhfSurhk$CxIrcBgEmQWuJp0vTNwUoftmyFAWOI2FRJg==.#=zPjCAc_8MXKIGmdwFzQ==(String #=zjuFyRwxEgKBvTcVDLQ==, LanguageTransformation #=zEK1H7UL8p5uW)
at #=z$qK8V0DhfSurhk$CxIrcBgEmQWuJp0vTNwUoftmyFAWOI2FRJg==.#=zPjCAc_8MXKIGmdwFzQ==(String #=zjuFyRwxEgKBvTcVDLQ==)
at #=zUunrrTVVeMMximvxJcKC5vAS3CH$EY7jj0VHcen1DNGLI4ZakQ==…ctor(String #=z3jrBwwQ=, #=zFxXZ3O4= #=zz8TQ5zs=, #=zFxXZ3O4= #=z8fq75y4pZ8Al, #=zaMPLq29GnytDlsw_a9$mM3EPQcm3WXNKcgDBpRxvWBPjxKHRT6SfLyk= #=zyVU99qNHunrf, #=zNnZKUM1IXHHjum$VoZx63wPG2DKpbVJS_6ifnKU= #=zoF0XDTs=, #=zTqXlOYlaxvwI3cvKbLLdxEuAppkpVLsBSRdnnimBnJ377Kdds7Ry2no= #=z9sIS6CR$1s6vC$b1Ww==, Double #=zJaRzr9M=, Double #=z6WAGitM=, #=zxClYwO_U2PJlynuaEcNw59TFtHZ2nGl8nBeR7DuS5MkQaHklKg== #=zL9U5X9I=, #=zSqzusIrgJAILlW0lx8UB_IPtm_AUpItEMF2J_3hc8H97fYp4z_Bjajc= #=zpI2X$HA=)
at #=z3AXh6KtNIg7GyS66nJhaeAtBVrFyLOstuZb$15TlByUnbvLiIWwczUs=.#=znb_go$KohrGB(String #=z3jrBwwQ=, #=zFxXZ3O4= #=zz8TQ5zs=, #=zFxXZ3O4= #=z8fq75y4pZ8Al, #=zaMPLq29GnytDlsw_a9$mM3EPQcm3WXNKcgDBpRxvWBPjxKHRT6SfLyk= #=zyVU99qNHunrf, #=zNnZKUM1IXHHjum$VoZx63wPG2DKpbVJS_6ifnKU= #=zoF0XDTs=, #=zTqXlOYlaxvwI3cvKbLLdxEuAppkpVLsBSRdnnimBnJ377Kdds7Ry2no= #=z9sIS6CR$1s6vC$b1Ww==, Double #=zJaRzr9M=, Double #=z6WAGitM=, #=zxClYwO_U2PJlynuaEcNw59TFtHZ2nGl8nBeR7DuS5MkQaHklKg== #=zL9U5X9I=, #=zSqzusIrgJAILlW0lx8UB_IPtm_AUpItEMF2J_3hc8H97fYp4z_Bjajc= #=zpI2X$HA=)
at #=zaMPLq29GnytDlsw_a9$mM3EPQcm3WXNKcgDBpRxvWBPjxKHRT6SfLyk=.#=zRZeS4f9MT8fB(Int32 #=zj_tKyxE=, Int32 #=zBtn42LQWv3R4, Operator #=zaD_aPjs=, #=zxClYwO_U2PJlynuaEcNw59TFtHZ2nGl8nBeR7DuS5MkQaHklKg== #=zL9U5X9I=)
at #=zaMPLq29GnytDlsw_a9$mM3EPQcm3WXNKcgDBpRxvWBPjxKHRT6SfLyk=.#=zEKC7wSJZc9nC(#=zFxXZ3O4= #=zz8TQ5zs=)
at #=zaMPLq29GnytDlsw_a9$mM3EPQcm3WXNKcgDBpRxvWBPjxKHRT6SfLyk=.#=zixpehgs=(Int32 #=zj_tKyxE=, Operator #=zaD_aPjs=)
at #=zaMPLq29GnytDlsw_a9$mM3EPQcm3WXNKcgDBpRxvWBPjxKHRT6SfLyk=.#=zGrsLAi0=()
at #=zmrx0$ubGDtAlE70HNdTzzaL4T0FlDXH02CTSyUnU6lMLqU1kg5hRvu2YDioE.#=zr4P8$RiCp60V(BaseOperatorCollection #=zPDa5OLA=, Resources #=zoF0XDTs=, Page #=zVCc1ATo=)
at #=zmrx0$ubGDtAlE70HNdTzzaL4T0FlDXH02CTSyUnU6lMLqU1kg5hRvu2YDioE.#=zr4P8$RiCp60V(BaseOperatorCollection #=zPDa5OLA=, Resources #=zoF0XDTs=)
at #=zmrx0$ubGDtAlE70HNdTzzaL4T0FlDXH02CTSyUnU6lMLqU1kg5hRvu2YDioE.#=z3eucivM=()
at #=zmrx0$ubGDtAlE70HNdTzzaL4T0FlDXH02CTSyUnU6lMLqU1kg5hRvu2YDioE…ctor(Page #=zVCc1ATo=, TextSearchOptions #=zBOGcYXTDcraD)
at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
at Aspose.Pdf.PageCollection.Accept(TextAbsorber visitor)

File that causes issues:
menu.pdf (241.3 KB)

@steph0125

We have tested the scenario using the latest version of Aspose.PDF for .NET 22.4 and have not found the shared issue. So, please use Aspose.PDF for .NET 22.4.