Extract PDF to Text file

sanjaybk · April 13, 2022, 4:05pm

Hi there,
I have attached a PDF page where paragraphs are in columns format. When i try to extract text using Apose.PDF, it is unable to read the actual paragraphs. Instead it is reading horizontally with tab space.

Can you please help me with custom code to extract texts from PDFs when paragraphs are formatted in columns?

I’m looking for below paragraphs.

44% of Amazon shareholders supported the Comptroller’s proposal for an independent review of the company’s civil rights, equity, diversity, and inclusion policies– a strong outcome for a first-time proposal.

Hilton Worldwide, Qorvo and Lowe’s agreed to disclose workforce diversity reports to make it easier for
investors to weigh the companies’ commitments to racial inclusion.

McDonald’s agreed to disclose workforce diversity data and tie executive compensation to improving iversity and to creating an employee culture of inclusion.
etc…Page10.pdf (93.7 KB)

asad.ali · April 13, 2022, 9:50pm

@sanjaybk

We tried the given code snippet that extracts paragraphs running over multiple columns but could not get the expected output. Therefore, an issue as PDFNET-51638 has been logged in our issue management system to further investigate the feasibility. We will look into its details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We apologize for the inconvenience.

Document pdfDocument = new Document(dataDir + "Page10.pdf");
            // Instantiate ParagraphAbsorber
            ParagraphAbsorber absorber = new ParagraphAbsorber();
            absorber.Visit(pdfDocument);

            foreach (PageMarkup markup in absorber.PageMarkups)
            {
                int i = 1;
                foreach (MarkupSection section in markup.Sections)
                {
                    int j = 1;

                    foreach (MarkupParagraph paragraph in section.Paragraphs)
                    {
                        StringBuilder paragraphText = new StringBuilder();
                        string ptext = paragraph.Text; // this line gives text blocks
                        foreach (List<TextFragment> line in paragraph.Lines)
                        {
                            foreach (TextFragment fragment in line)
                            {
                                paragraphText.Append(fragment.Text);
                            }
                            paragraphText.Append("\r\n");
                        }
                        paragraphText.Append("\r\n");

                        Console.WriteLine("Paragraph {0} of section {1} on page {2}:", j, i, markup.Number);
                        Console.WriteLine(paragraphText.ToString());

                        j++;
                    }
                    i++;
                }
            }

sanjaybk · April 19, 2022, 4:57pm

@asad.ali, Thanks for looking into my request.

May i know if there is any progress on finding out solution to extract paragraphs running over multiple columns?

asad.ali · April 19, 2022, 9:10pm

@sanjaybk

The ticket has recently been logged in our issue management system and we are afraid that it is pending for a review. Please note that as per free support policies, it will be analyzed and resolved on a first come first serve basis. We will surely inform you as soon as we have some definite updates about its resolution. Please spare us some time.

We are sorry for the inconvenience.

sanjaybk · April 29, 2022, 3:42pm

@asad.ali,

Thank you. I will wait. In the meantime, can you please help me to extract the table from attached PDF into C# DataTable object?

I’m facing following issues:

It is difficult to identify the first row to create DataTable headers.
In the second row, sometimes, object has data for few row values only. It is missing few values.

I’m looking for C# DataTable object with below columns
Column1 - GICS
Column2 - Description
Column3 - Mean
Column4 - Standard Deviation
Column5 - Burn Rate Benchmark*

And then, all other values in DataTable Row object.

Thank you!Table_extract_to_DataTable.pdf (148.8 KB)

asad.ali · April 29, 2022, 11:08pm

@sanjaybk

We are checking it and will get back to you shortly.

asad.ali · May 16, 2022, 8:29pm

@sanjaybk

We have tried to extract the table values using the below code snippet and noticed that the API was unable to extract any data from the existing table in the PDF document:

Document pdfDocument = new Document(dataDir + "Table_extract_to_DataTable.pdf");
foreach (var page in pdfDocument.Pages)
{
 Aspose.Pdf.Text.TableAbsorber absorber = new Aspose.Pdf.Text.TableAbsorber();
 absorber.Visit(page);
 foreach (AbsorbedTable table in absorber.TableList)
 {
  foreach (AbsorbedRow row in table.RowList)
  {
   foreach (AbsorbedCell cell in row.CellList)
   {
    TextFragment textfragment = new TextFragment();
    TextFragmentCollection textFragmentCollection = cell.TextFragments;
    foreach (TextFragment fragment in textFragmentCollection)
    {
     Console.WriteLine(textfragment.Text);
    }
    Console.WriteLine("Cell");
   }
   Console.WriteLine("Row");
  }
 } 
}

Can you please share the code snippet through which you are able to extract some of the values? We will further proceed to assist you accordingly.