TextAbsorber adds extra spaces when extracting from a french pdf

Telsin.Brower · September 11, 2017, 10:47pm

We are using version 17.8.
Here’s the code to extract text from the attached pdf:

    public IList<string> GetPdfPagesAsStrings(byte[] bytes)
    {
        var resultPagesContent = new List<string>();
        if (bytes == null || bytes.Length == 0)
        {
            return resultPagesContent;
        }
        using (var stream = new MemoryStream(bytes))
        {
            InitializeConverter(); // this just sets the Aspose.Pdf license

            // Open document
            var pdfDocument = new Document(stream);

            for (var i = 0; i < pdfDocument.Pages.Count; i++)
            {
                var textAbsorber = new TextAbsorber();
                textAbsorber.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
                textAbsorber.TextSearchOptions.LimitToPageBounds = true;
                var page = pdfDocument.Pages[i + 1];
                page.Accept(textAbsorber);
                var pageText = textAbsorber.Text;

                resultPagesContent.Add(pageText);
            }
            return resultPagesContent;
        }
    }

Text from page 1 - notice the space after the accent in “obste´ tricales”:
Morbidite´ maternelle grave par causes\r\nobste´ tricales directes en Afrique de l’Ouest :\r\nincidence et le´ talite´\r\nA. Prual,1 M.-H. Bouvier-Colle,2 L. de Bernis,3 G. Bre´art4 et le groupe MOMA5\r\nLes donne´es sur la morbidite´ maternelle permettent d’e´valuer le nombre des femmes susceptibles de re´clamer des\r\nsoins obste´tricaux essentiels, et d’organiser, de surveiller et d’e´valuer les programmes de maternite´ sans risque. Le\r\npre´sent article propose des de´finitions ope´rationnelles de la morbidite´ maternelle grave et fournit des donne´es sur la\r\nfre´quence de cette morbidite´ telle qu’elle ressort d’une enqueˆte en population portant sur une cohorte de\r\n20 326 femmes enceintes dans six pays d’Afrique occidentale. La meˆmeme´thodologie et les meˆmes questionnaires\r\nont e´te´ utilise´ s dans toutes les zones. Chaque femme enceinte a e´te´ en contact a quatre reprises avec l’e´ quipe\r\nd’enqueˆte : au moment de son enroˆ lement dans l’e´tude, entre la 32e et la 36e semaines d’ame´norrh e´ e, pendant\r\nl’accouchement et 60 jours apres l’accouchement. Des causes obste´tricales directes de morbidite´ grave ont e´te´\r\nobserve´es chez 1215 femmes (6,17 cas pour 100 naissances vivantes). Ce rapport varie sensiblement selon les zones,\r\nde 3,01 % a Bamako a 9,05 % a Saint-Louis. Les principales causes obste´tricales directes de morbidite´ maternelle\r\ngrave sont les suivantes : he´morragie (3,05 pour 100 naissances vivantes) ; dystocie (2,05 %), dont 23 cas de rupture\r\nute´rine (0,12 %) ; hypertension gravidique (0,64 %), dont 38 cas d’e´clampsie (0,19 %) et infection (0,09 %). Les\r\nautres causes obste´tricales directes repre´sentent 12,2 % des cas. Les taux de le´talite´ sont trese´leve´s pour l’infection\r\n(33,3 %), la rupture ute´rine (30,4 %) et l’e´clampsie (18,4 %) ; la le´talite´ lie´e a l’he´ morragie oscille entre 1,9 %\r\n(he´morragie de l’ante-partum et du per-partum) et 3,7 % (de´collement pre´mature´ du placenta). C’est ainsi que 3 a\r\n9 % des femmes enceintes, au moi
ns, requierent des soins obste´tricaux essentiels. Les taux de le´ talite´e´leve´s,\r\nenregistre´s pour plusieurs complications, refletent la mauvaise qualite´ des soins obste´tricaux.\r\nArticle publie´ en anglais dans Bulletin of the World Health Organization, 2000, 78 (5) : 593-602.\r\nIntroduction\r\nL’initiative pour la maternite´ sans risque, lance´e en\r\n1987, avait pour objectif de re´duire de moitie´la\r\nmortalite´ maternelle avant l’an 2000 (1, 2).Une\r\nde´cennie plus tard, les estimations de la mortalite´\r\nmaternelle en Afrique subsaharienne ne faisaient\r\napparaıˆtre aucune ame´lioration alors qu’un groupe\r\nd’experts avait estime´ en 1985 que 88-98 % des de´ces\r\nmaternels pouvaient eˆtre e´vite´s, meˆme dans les\r\nconditions qui caracte´risaient alors la plupart des pays\r\nen de´veloppement (3, 4).\r\nEn Afrique de l’Ouest, certaines donne´es\r\nestimaient a 1020 pour 100 000 naissances vivantes\r\nle nombre des de´ces maternels, soit 38 fois plus que\r\ndans des re´gions plus de´ veloppe´es (3).Bien que leur\r\nutilite´ soit de plus en plus reconnue, les donne´es\r\nconcernant l’incidence et les caracte´ristiques de la\r\nmorbidite´a l’origine de ces taux e´leve´s sont\r\nextreˆmement rares (5-9).Les donne ´es sur la\r\nmorbidite´ sont d’une importance capitale pour les\r\nde´cideurs et les planificateurs sanitaires, qui doivent\r\nsavoir combien de femmes requierent des soins\r\nobste´tricaux essentiels.De plus, en raison de la\r\ncomplexite´ des mesures a effectuer, les taux de\r\nmortalite´ maternelle ne permettent guere d’e´valuer la\r\nre´ussite des programmes (10, 11).Les donne´es sur la\r\nmorbidite´ maternelle sont suppose´es eˆtre de meil-\r\nleurs indicateurs pour la conception, la surveillance, le\r\nsuivi et l’e´valuation des programmes de maternite´\r\nsans risque.Elles proviennent le plus souvent\r\nd’e´tudes hospitalieres et sont, non pas prospectives\r\nmais re´trospectives (5, 12).Il y a peu encore, les\r\nestimations des sche´mas de morbidite´ maternelle\r\ne´tai
ent fonde´es sur une petite e´tude en population\r\nre´alise´e dans une communaute´ rurale en Inde (13).\r\nCes dernieres anne´es, plusieurs e´tudes ont e´te´\r\nconc¸ues pour e´valuer la fiabilite´ des donne´es relatives\r\na la morbidite´ maternelle obtenues au moyen\r\nd’entretiens avec les femmes et les sages-femmes\r\ndans la communaute´(14-17).Il apparaıˆt maintenant\r\nque les re´sultats obtenus a partir des donne´es issues\r\nd’entretiens ne sont pas suffisamment valables\r\n1 Conseiller, Direction de la Planification, de la Coope´ration et de la\r\nStatistique, Ministere de la Sante´ et des Affaires Sociales (Mauritanie).\r\n2 Directeur de Recherches, Institut National de la Sante´etde\r\nla Recherche Me´dicale (INSERM), Unite´ 149, Recherches\r\ne´pide´miologiques en sante´pe´rinatale et sante´ des femmes,\r\n123, bd de Port-Royal, 75014 Paris (France) (me´l. : bouvier-colle\r\n@cochin.inserm.fr). (Correspondance)\r\n3 Ancien Conseiller, Direction de la Sante´ maternelle et infantile,\r\nMiniste`
re de la Sante´ et des Affaires sociales, Dakar (Se´ne´gal).\r\n4 Professeur des Universite´s-Praticien Hospitalier, Universite´ Paris VI\r\net Directeur de l’Unite´ 149, Institut National de la Sante´etdela\r\nRecherche Me´dicale (INSERM), Paris (France).\r\n5 Voir p. 136, Remerciements.\r\nRe´f.:99-0351\r\n129Bulletin de l’Organisation mondiale de la Sante´ # Organisation mondiale de la Sante´, 2000\r\nRecueil d’articles No 3, 2000
French.pdf (253.0 KB)

We also need each segment StartCharIndex and EndCharIndex to represent the running index of all characters on the page rather than per line break. Is this possible? Here’s a snippet of that code.
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+", new TextSearchOptions(true));
textFragmentAbsorber.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
textFragmentAbsorber.TextSearchOptions.LimitToPageBounds = true;
var page = pdfDocument.Pages[pageNumber];
page.Accept(textFragmentAbsorber);
var textFragmentCollection = textFragmentAbsorber.TextFragments;

This too has the problem with the extra space so for example, instead of a single segment with Text=“obste´tricales”, it is split into 2 segments with text = “obste” and “tricales”.

So, I am looking for 2 things:

words with an accent be treated as a single word
segment StartCharIndex and EndCharIndex be a running index of the text on the page instead of per line break.

thanks,

imran.rafique · September 12, 2017, 3:51am

@Telsin.Brower,

When you will open your PDF in the Acrobat, copy this word from the PDF document to a notepad, and then you will find a white-space inside this word. However, we have logged an investigation under the ticket ID PDFNET-43334 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.

Please try the following code example:

[C#]

string dataDir = @"C:\Pdf\test291\";
Document pdfDocument = new Document(dataDir + "French.pdf");

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+", new TextSearchOptions(true));
textFragmentAbsorber.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
textFragmentAbsorber.TextSearchOptions.LimitToPageBounds = true;

var page = pdfDocument.Pages[1];
page.Accept(textFragmentAbsorber);
var textFragmentCollection = textFragmentAbsorber.TextFragments;

foreach (TextFragment fragment in textFragmentCollection)
{
    foreach (TextSegment segment in fragment.Segments)
    {
        Console.WriteLine(segment.StartCharIndex + segment.EndCharIndex);
    }
}

Telsin.Brower · September 12, 2017, 4:51pm

RE: Item 1. A couple things I’ve tried:

Opened the document in the browser and did a find for “obste´tricales” no space and it found 30 occurrences. I then copied the word from the page in the browser to notepad and I see no space.
Opened the document in acrobat reader did a find for “obste´tricales” no space and it found 0 occurrences. Did again with “obste´ tricales” coped from notepad with space and also found 0 occurrences.

Seems to me the behavior I want is number 1. When viewing the document in either the browser or Acrobat reader, my eye tells me there is no space.

RE: Item 2. Don’t understand your point here. I have that code - I just didn’t include the entire code swath in my original report. Here’s what I’m trying to illustrate with this code.

    public void DumpFragments(byte[] bytes)
    {
        using (var stream = new MemoryStream(bytes))
        {
            InitializeConverter(); // init license

            // Open document
            var pdfDocument = new Document(stream);

            // get all the words on the page
            TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+", new TextSearchOptions(true));
            textFragmentAbsorber.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
            textFragmentAbsorber.TextSearchOptions.LimitToPageBounds = true;
            var page = pdfDocument.Pages[1];
            page.Accept(textFragmentAbsorber);
            var textFragmentCollection = textFragmentAbsorber.TextFragments;

            var sb = new StringBuilder();
            var fragmentCount = 0;

            foreach (TextFragment fragment in textFragmentCollection)
            {
                fragmentCount++;
                var segmentCount = 0;
                foreach (TextSegment segment in fragment.Segments)
                {
                    segmentCount++;
                    sb.AppendFormat(@"fragment: {3} segment:{4} StartIndex:{0} EndIndex:{1} Text:{2}", segment.StartCharIndex, segment.EndCharIndex, segment.Text, fragmentCount, segmentCount).AppendLine();
                }
            }
        }
    }

Snippet of output:
fragment: 1 segment:1 StartIndex:0 EndIndex:8 Text:Morbidite\r\n
fragment: 1 segment:2 StartIndex:0 EndIndex:0 Text:´\r\n
fragment: 2 segment:1 StartIndex:2 EndIndex:11 Text:maternelle\r\n
fragment: 3 segment:1 StartIndex:13 EndIndex:17 Text:grave\r\n
fragment: 4 segment:1 StartIndex:19 EndIndex:21 Text:par\r\n
fragment: 5 segment:1 StartIndex:23 EndIndex:28 Text:causes\r\n
fragment: 6 segment:1 StartIndex:0 EndIndex:4 Text:obste\r\n
fragment: 6 segment:2 StartIndex:0 EndIndex:0 Text:´\r\n
fragment: 7 segment:1 StartIndex:2 EndIndex:9 Text:tricales\r\n
fragment: 8 segment:1 StartIndex:11 EndIndex:18 Text:directes\r\n
fragment: 9 segment:1 StartIndex:20 EndIndex:21 Text:en\r\n

notice that fragment 1 segment 1 startindex is 0. Why? What does this mean?
fragment 2 segment 1 startindex is 2. I expected 10
fragment 3 segment 1 startindex is 13. I expected 21
and so on…
So what I mean by a running index is that character 1 in the page is index 0 and the last character in the page is Page.Text.Length-1.

imran.rafique · September 13, 2017, 12:40am

@Telsin.Brower,

We have logged details under the same ticket ID PDFNET-43334 in our issue tracking system.

Telsin.Brower:

RE: Item 2. Don’t understand your point here. I have that code - I just didn’t include the entire code swath in my original report. Here’s what I’m trying to illustrate with this code.
public void DumpFragments(byte[] bytes)

We have logged an investigation under the ticket ID PDFNET-43341 in our issue tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.

lawray · February 20, 2021, 7:08am

Support team, could you please update the status of the folowwing issues?
PDFNET-43334
PDFNET-43341

asad.ali · February 21, 2021, 6:16pm

@lawray

The described situation in the above ticket is not a bug. The point is that segment.StartCharIndex and segment.EndCharIndex do not mean position of the segment in the line. But it means index of the start/end of the current (logical) segment in the text showing operator is the contents of the PDF page.
For example operator that shows text of logical word “grave” is actually looks like:

[(´)-396.1(maternelle)-250.1(grave)-250.5(par)-250.2(causes)]TJ

So, segment ‘grave’ starts from position 13 in the operator text. Taking into account that values like ‘-396.1’ represent space “character” in the text.

We have no special function to get character position in the text line. Because PDF specification does not describes text in the terms like “line”, “word”. But we consider the possibility of implementation of that function.

We have made some investigation and found that the Adobe Acrobat also returns text as “obste´ tricales”. We will try to find some solution but regretfully, we cannot share any ETA at the moment about ticket resolution. As soon as additional updates are available, we will let you know.

We apologize for the inconvenience.