How can I extract text by text block like it's displayed in Acrobat DC?

I use Acrobat DC to open an pdf document and switched to edit mode, I saw text blocks surrounded by light gray border.
How can I extract text and group them by text blocks?
Here is an sample which is simply a TWAIN documentation
43Page_350K_TEXT.pdf (354.6 KB)
And here is a picture under edit mode
image.png (56.3 KB)

@Ragnarokkr.Xia

Could you please share the sample PDF along with the screenshot of text blocks shown in Adobe Reader? We will try to produce the expected output using API and share our feedback with you.

@asad.ali
I have edited my question, sample is now included. Thank you for your paying attention to my question!

@Ragnarokkr.Xia

We used the below code snippet to extract text from your PDF and found that API was extracting text blocks. However, they were not the same blocks as shown in the image but API was extracting smaller chunks of text. Please check it and share with us if this suits you? Otherwise, we will investigate your requirements further:

Document pdfDocument = new Document(dataDir + "43Page_350K_TEXT.pdf");
// Instantiate ParagraphAbsorber
ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.Visit(pdfDocument);

foreach (PageMarkup markup in absorber.PageMarkups)
{
 int i = 1;
 foreach (MarkupSection section in markup.Sections)
 {
  int j = 1;

  foreach (MarkupParagraph paragraph in section.Paragraphs)
  {
   StringBuilder paragraphText = new StringBuilder();
   string ptext = paragraph.Text; // this line gives text blocks
   foreach (List<TextFragment> line in paragraph.Lines)
   {
    foreach (TextFragment fragment in line)
    {
     paragraphText.Append(fragment.Text);
    }
    paragraphText.Append("\r\n");
   }
   paragraphText.Append("\r\n");

   Console.WriteLine("Paragraph {0} of section {1} on page {2}:", j, i, markup.Number);
   Console.WriteLine(paragraphText.ToString());

   j++;
  }
  i++;
 }
}

@asad.ali
Actually I have a requirement to extract text seperated in three text block in three columns in a row.
The sample I’m working on is a bit classified, I cannot directly upload it in this thread.
Is there any approach that I can send the sample to you privately?

@Ragnarokkr.Xia

We have sent you a private message that you can check in your inbox. You can please share your sample in reply to that private message.

@Ragnarokkr.Xia

Thanks for sharing the sample files and other information in the private message.

Could you please also share the code snippet which you used to extract data from these PDFs along with the screenshots of the results which were obtained using the API at your side? We will further investigate the scenario and share our feedback with you.

@asad.ali
Here is the code snippet I used to extract data. It’s such a suffering to format the code in the post so I just uploaded the code file. I’m not allowed to upload .cs file directly so I have to zip it.
code.zip (1.4 KB)
The screenshot of the result will be sent via private message.

Thanks a lot !

@Ragnarokkr.Xia

We have checked the details shared in private message as well as tested the scenario in our environment as well. We were able to notice the similar behavior of the API at our side. An investigation ticket as PDFNET-50451 has been logged in our issue tracking system for the sake of further analysis. We will look into its details and keep you posted with the status of its resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

@asad.ali
Thank you, I am looking forward to the result.

1 Like