How can I extract text by text block like it's displayed in Acrobat DC?

Ragnarokkr.Xia · August 23, 2021, 9:45am

I use Acrobat DC to open an pdf document and switched to edit mode, I saw text blocks surrounded by light gray border.
How can I extract text and group them by text blocks?
Here is an sample which is simply a TWAIN documentation
43Page_350K_TEXT.pdf (354.6 KB)
And here is a picture under edit mode
image.png (56.3 KB)

asad.ali · August 23, 2021, 5:53pm

@Ragnarokkr.Xia

Could you please share the sample PDF along with the screenshot of text blocks shown in Adobe Reader? We will try to produce the expected output using API and share our feedback with you.

Ragnarokkr.Xia · August 24, 2021, 2:07am

@asad.ali
I have edited my question, sample is now included. Thank you for your paying attention to my question!

asad.ali · August 24, 2021, 8:38pm

@Ragnarokkr.Xia

We used the below code snippet to extract text from your PDF and found that API was extracting text blocks. However, they were not the same blocks as shown in the image but API was extracting smaller chunks of text. Please check it and share with us if this suits you? Otherwise, we will investigate your requirements further:

Document pdfDocument = new Document(dataDir + "43Page_350K_TEXT.pdf");
// Instantiate ParagraphAbsorber
ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.Visit(pdfDocument);

foreach (PageMarkup markup in absorber.PageMarkups)
{
 int i = 1;
 foreach (MarkupSection section in markup.Sections)
 {
  int j = 1;

  foreach (MarkupParagraph paragraph in section.Paragraphs)
  {
   StringBuilder paragraphText = new StringBuilder();
   string ptext = paragraph.Text; // this line gives text blocks
   foreach (List<TextFragment> line in paragraph.Lines)
   {
    foreach (TextFragment fragment in line)
    {
     paragraphText.Append(fragment.Text);
    }
    paragraphText.Append("\r\n");
   }
   paragraphText.Append("\r\n");

   Console.WriteLine("Paragraph {0} of section {1} on page {2}:", j, i, markup.Number);
   Console.WriteLine(paragraphText.ToString());

   j++;
  }
  i++;
 }
}

Ragnarokkr.Xia · August 25, 2021, 8:24am

@asad.ali
Actually I have a requirement to extract text seperated in three text block in three columns in a row.
The sample I’m working on is a bit classified, I cannot directly upload it in this thread.
Is there any approach that I can send the sample to you privately?

asad.ali · August 25, 2021, 8:50pm

@Ragnarokkr.Xia

We have sent you a private message that you can check in your inbox. You can please share your sample in reply to that private message.

asad.ali · August 26, 2021, 9:49pm

@Ragnarokkr.Xia

Thanks for sharing the sample files and other information in the private message.

Could you please also share the code snippet which you used to extract data from these PDFs along with the screenshots of the results which were obtained using the API at your side? We will further investigate the scenario and share our feedback with you.

Ragnarokkr.Xia · August 27, 2021, 5:05am

@asad.ali
Here is the code snippet I used to extract data. It’s such a suffering to format the code in the post so I just uploaded the code file. I’m not allowed to upload .cs file directly so I have to zip it.
code.zip (1.4 KB)
The screenshot of the result will be sent via private message.

Thanks a lot !

asad.ali · August 27, 2021, 6:27pm

@Ragnarokkr.Xia

We have checked the details shared in private message as well as tested the scenario in our environment as well. We were able to notice the similar behavior of the API at our side. An investigation ticket as PDFNET-50451 has been logged in our issue tracking system for the sake of further analysis. We will look into its details and keep you posted with the status of its resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

Ragnarokkr.Xia · August 29, 2021, 8:22am

@asad.ali
Thank you, I am looking forward to the result.