Hi all,
We are currently trying to use Aspose.PDF to extract text from a PDF file on a line-for-line, or paragraph, basis.
However we discovered that extracting the text fragments sometime results in multiple fragments for a single line of text.
For example, the line "Op dit moment ondervind ik veel geluidsoverlast van het bedrijf Jimmie’s Pizza gevestigd aan de " results in these text fragments:
- Op dit moment ondervind ik veel geluidsoverlast van het bedrijf
- Jimmie
- ’
- s Pizza
- gevestigd aan de
We have tried multiple approaches to extract text on a line-by-line basis, through the TextFragmentAbsorber and the ParagraphAbsorber but both yield the same result.
Is there an alternative method we can use to solve this requirement?
The test code I used with the TextFragmentAbsorber:
byte[] pdfFile = File.ReadAllBytes(@"Voorbeeld brief maskeren.pdf");
// Convert the byte array to a memorystream so it can be processed
MemoryStream payloadStream = new MemoryStream(pdfFile);
// Import in Aspose
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(payloadStream);
// Retrieve all the textfragments inside the PDF
Aspose.Pdf.Text.TextFragmentAbsorber absorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
pdfDocument.Pages.Accept(absorber);
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = absorber.TextFragments;
int lNr = 0;
foreach (Aspose.Pdf.Text.TextFragment textFrag in textFragmentCollection)
{
// Get the text fragment
var textFragment = textFrag.Text;
Console.WriteLine("Line nr: " + lNr.ToString() + " text: " + textFragment);
lNr++;
}