How to extract every word of text and their position coordinates

Hello,

I’m testing ASPOSE PDF for .NET. (VB.NET)
I’m trying to extract every word and to know his position coordinates.
I have next code and 2 questions:
#########################################################
Dim pdfDocument As Aspose.Pdf.Document
Dim license As Aspose.Pdf.License = New Aspose.Pdf.License()
license.SetLicense(“Aspose.Pdf.lic”)
license.Embedded = True
pdfDocument = New Aspose.Pdf.Document(“c:\test.pdf”)
For pageNo As Integer = 1 To pdfDocument.Pages.Count
Dim textFragmentAbsorber As New Aspose.Pdf.Text.TextFragmentAbsorber()
pdfDocument.Pages(pageNo).Accept(textFragmentAbsorber)
Dim textFragmentCollection As Aspose.Pdf.Text.TextFragmentCollection = textFragmentAbsorber.TextFragments
For Each textFragment As Aspose.Pdf.Text.TextFragment In textFragmentCollection
For Each textSegment As Aspose.Pdf.Text.TextSegment In textFragment.Segments
MsgBox “Word=” & textSegment.Text & vbCrLf & “Position=” & textSegment.Position.XIndent & “,” & textSegment.Position.YIndent
Next textSegment
Next textFragment
Next pageNo
pdfDocument.Dispose()
#########################################################
But it doesn’t return every word. It returns “Hello World” (for instance).
1) I would like that returns first one “Hello” and next one “World”
2) I need coordinates of every word and width/height too.

Thanks.

Toni Jiménez

Hi Toni,


Thanks
for using our products.

I
have tested the scenario and I am able to reproduce the same problem. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWNET-34627. We
will investigate this issue in details and will keep you updated on the status
of a correction.

We
apologize for your inconvenience.

Hi Toni,


Thanks for your patience.

We have further investigated the issue PDFNEWNET-34627 and I am pleased to share that we have been able to resolve the problem. Its resolution will be includes in upcoming release version of Aspose.Pdf for .NET 7.9.0.

Please note that when using the following code snippet, it returns valid physical text segments as they defined in pdf page contents.

Furthermore, in order to extract every word, a regular expression should be used like
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+", new TextSearchOptions(true));

Full code sample is given below. Please note that following code snippet will work correctly with 7.9.0.


[C#]

//open document<o:p></o:p>

Document pdfDocument = new Document("c:/pdftest/Arabic_Farsi_Text_in_TableCell.pdf");

//TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Sample");

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+", new TextSearchOptions(true));

//accept the absorber for all the pages

pdfDocument.Pages[1].Accept(textFragmentAbsorber);

//get the extracted text fragments

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

//loop through the fragments

foreach (TextFragment textFragment in textFragmentCollection)

{

foreach (TextSegment textSegment in textFragment.Segments)

{

Console.WriteLine("Text : {0} ", textSegment.Text);

Console.WriteLine("Position : {0} ", textSegment.Position);

Console.WriteLine("XIndent : {0} ", textSegment.Position.XIndent);

Console.WriteLine("YIndent : {0} ", textSegment.Position.YIndent);

}

}

The issues you have found earlier (filed as PDFNEWNET-34627) have been fixed in Aspose.Pdf for .NET 7.9.0.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.