Sentences are broken mid-word into multiple TextFragments

mjanulaitis · March 26, 2019, 10:38pm

I am scanning a document only to find some sentences broken into multiple TextFragments which totally messes up my parsing. I need each title, sub-title and paragraph to be contained in a single TextFragment. Is this possible?

asad.ali · March 26, 2019, 10:43pm

@mjanulaitis

TextFragmentAbsorber Class will always return collection of text fragments similar to how they were added in the PDF document. However, in case you want to extract complete text of PDF document as single String object, you may use TextAbsorber as follows:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(dataDir + "input.pdf");
Aspose.Pdf.Text.TextAbsorber textAbsorber = new Aspose.Pdf.Text.TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
string extractedtext = textAbsorber.Text;

mjanulaitis · March 26, 2019, 11:18pm

I have no interest in pulling all the text. I have words that are broken in the middle. How could that have been added like that to the original document? Is there a way to prove the the user added a sentence broken in the middle within one word? I’m have a difficult time excepting that as an answer.

asad.ali · March 26, 2019, 11:21pm

@mjanulaitis

We apologize for the inconvenience.

Would you please share your sample PDF document with us and share details of the words you want to extract/replace using Aspose.PDF API. We will surely test the scenario in our environment and address it accordingly.

mjanulaitis · March 27, 2019, 12:37am

Here is the file. Search for ‘or the vendor is a controller or a processor subject’. When scanning the document the word ‘subject’ is broken between fragments as ‘…subjec’ then ‘t…’

Edited1010175Guidance on international data trans_71478527.pdf (82.2 KB)

asad.ali · March 27, 2019, 9:54am

@mjanulaitis

We have tested the scenario using Aspose.PDF for .NET 19.3 and following code snippet. We were unable to observe broken word i.e. ‘subject’. For your kind reference, we have attached an output console screenshot as well.

var textFragmentAbsorber = new TextFragmentAbsorber("or the vendor is a controller or a processor subject");
var textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
Document pdfDocument = new Document(dataDir + "Edited1010175Guidance on international data trans_71478527.pdf");
pdfDocument.Pages.Accept(textFragmentAbsorber);
var textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment textFragment in textFragmentCollection)
{
 foreach(TextSegment segment in textFragment.Segments)
 {
  Console.WriteLine(segment.Text);
 }
}

FoundSentence.png (1001 Bytes)

Would you please try using latest version of the API i.e. Aspose.PDF for .NET 19.3 and in case issue still persists, please share your complete sample code snippet with us. We will test the scenario in our environment and address it accordingly.