Document extraction issue- Extracting PDF document into Gibberish

sterlingvdrdevteam · November 22, 2023, 10:41am

Hi,

We are using Aspose page PDF extraction method, But below specified method is extracting the text into Gibberish. Please have a look into it as a priority issue.

public PageContent Extract(Aspose.Pdf.Page page)
{
    var pageRect = page.GetPageRect(false);
    var size = new Size(pageRect.Width, pageRect.Height);
    var position = new Position(pageRect.LLX, pageRect.LLY);
    var pageRotation = (int)page.Rotate;

    var pageContent = new PageContent(page.Number, size, position, pageRotation);
    page.Rotate = Rotation.None;

    var tables = _tableCellExtractor.ExtractTables(page).ToList();

    var paragraphs = _paragraphExtractor.ExtractParagraphs(page, tables);
    foreach (var paragraph in paragraphs)
    {
        pageContent.AddParagraph(paragraph);
    }

    foreach (var tableCell in tables.Where(t => t.HasText).SelectMany(t => t.Rows.SelectMany(r => r.Cells)))
    {
        pageContent.AddCell(tableCell.CellArea);
    }

    return pageContent;
}

public IEnumerable<Table> ExtractTables(Page page)
{
    var absorber = new Aspose.Pdf.Text.TableAbsorber();
    absorber.Visit(page);
    return absorber.TableList.Select(ExtractTable);
}

private Table ExtractTable(AbsorbedTable table)
{
    var tableRectangle = Rectangle.FromAsposeRectangle(table.Rectangle);
    var rows = table.RowList.Select(ExtractRow).ToList();
    return new Table()
    {
        TableRectangle = tableRectangle,
        Rows = rows
    };
}

public IEnumerable<TextArea> ExtractParagraphs(Aspose.Pdf.Page page, List<Table> tables)
{
    var absorber = new ParagraphAbsorber();
    absorber.Visit(page);
    var markup = absorber.PageMarkups[0];
    var markupParagraphs = markup.Sections.SelectMany(s => s.Paragraphs);

    return markupParagraphs.SelectMany(p => PrepareParagraph(p, tables));
}

private IEnumerable<TextArea> PrepareParagraph(MarkupParagraph markupParagraph, List<Table> tables)
{
    var textAreas = ExtractTextAreas(markupParagraph, tables).ToList();
    ApplyWhiteSpaces(markupParagraph, textAreas);
    return textAreas;
}

Here with we have attached the sample document.
SampleDocs_AsposeIssue.pdf (25.6 KB)

asad.ali · November 22, 2023, 7:27pm

@sterlingvdrdevteam

Your shared code snippet has some undefined and missing classes. Can you please share a minimal code snippet to replicate the issue? You can also share a sample console application in .zip format that can help us in replicating the issue in our environment and address it accordingly.

PS: Please make sure that you are using 23.11 version of the API and all required fonts are installed in the system.

sterlingvdrdevteam · November 24, 2023, 9:24am

Hi ,

Thanks for replying to us, I have created a console application to extract the text from the document.
You can use the attached sample document and debug the code. You can see the first line of the document itself extracting wrongly. the console application is zipped and attached in the drive. Please download it from the link.

asad.ali · November 24, 2023, 6:48pm

@sterlingvdrdevteam

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55998

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.