Cannot find text with Hebrew regex

erdeiga · February 16, 2024, 9:08am

Hi Team!

There is an issue when searching for Hebrew text in PDF with regex. I have a C# Regex object with a Hebrew pattern and when I use the TextFragmentAbsorber it doesn’t find anything but the attached pdf file contains the text that should be matched.

When I extract the Page text with the TextAbsorber the searched word contains some spaces in the output text and I don’t know why.

Pdf file: test-dlp.pdf (46.4 KB)

Regex pattern: סודי

Sample Project: sample-project.zip (2.0 KB)

Aspose.PDF: 24.1.0

If I change the pattern (reverse and add spaces) then there is a match. Edited pattern: י ס וד

What is the reason the original pattern doesn’t match and why there are extra spaces in the extracted text?

andriy.andrukhovski · February 16, 2024, 9:52am

@erdeiga
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56566

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

zpopswat · July 7, 2025, 1:29pm

Hi Team,

Is there any ETA for this fix? Thank you!

sergei.shibanov · July 30, 2025, 8:58am

@zpopswat
We’ve investigated the issue and found that it requires significant changes to several components related to right-to-left text handling. We’ll continue to investigate and track the issue internally, but a fix won’t be available anytime soon as we’re prioritizing paid support and can’t provide an ETA.