Hello,
I’m testing ASPOSE PDF for .NET. (VB.NET)
I’m trying to extract every word and to know his position coordinates.
I have next code and 2 questions:
#########################################################
Dim pdfDocument As Aspose.Pdf.Document
Dim license As Aspose.Pdf.License = New Aspose.Pdf.License()
license.SetLicense(“Aspose.Pdf.lic”)
license.Embedded = True
pdfDocument = New Aspose.Pdf.Document(“c:\test.pdf”)
For pageNo As Integer = 1 To pdfDocument.Pages.Count
Dim textFragmentAbsorber As New Aspose.Pdf.Text.TextFragmentAbsorber()
pdfDocument.Pages(pageNo).Accept(textFragmentAbsorber)
Dim textFragmentCollection As Aspose.Pdf.Text.TextFragmentCollection = textFragmentAbsorber.TextFragments
For Each textFragment As Aspose.Pdf.Text.TextFragment In textFragmentCollection
For Each textSegment As Aspose.Pdf.Text.TextSegment In textFragment.Segments
MsgBox “Word=” & textSegment.Text & vbCrLf & “Position=” & textSegment.Position.XIndent & “,” & textSegment.Position.YIndent
Next textSegment
Next textFragment
Next pageNo
pdfDocument.Dispose()
#########################################################
But it doesn’t return every word. It returns “Hello World” (for instance).
1) I would like that returns first one “Hello” and next one “World”
2) I need coordinates of every word and width/height too.
Thanks.
Toni Jiménez
Hi Toni,
for using our products.
I
have tested the scenario and I am able to reproduce the same problem. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWNET-34627. We
will investigate this issue in details and will keep you updated on the status
of a correction.
We
apologize for your inconvenience.
Hi Toni,
Furthermore, in order to extract every word, a regular expression should be used like
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+", new TextSearchOptions(true));
Full code sample is given below. Please note that following code snippet will work correctly with 7.9.0.
[C#]
//open document<o:p></o:p>
Document pdfDocument = new Document("c:/pdftest/Arabic_Farsi_Text_in_TableCell.pdf");
//TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Sample");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+", new TextSearchOptions(true));
//accept the absorber for all the pages
pdfDocument.Pages[1].Accept(textFragmentAbsorber);
//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
foreach (TextSegment textSegment in textFragment.Segments)
{
Console.WriteLine("Text : {0} ", textSegment.Text);
Console.WriteLine("Position : {0} ", textSegment.Position);
Console.WriteLine("XIndent : {0} ", textSegment.Position.XIndent);
Console.WriteLine("YIndent : {0} ", textSegment.Position.YIndent);
}
}
The issues you have found earlier (filed as PDFNEWNET-34627) have been fixed in Aspose.Pdf for .NET 7.9.0.
This message was posted using Notification2Forum from Downloads module by aspose.notifier.