How to extract separate text strings

KDSDEV · June 1, 2018, 8:36am

On page 2 in this file 140120170615407771.pdf (2.7 MB), there are 3 separate yellow rectangles, which have Japanese words collectively inside, lined up horizontally on the upper side of the page.

I’d like to extract those texts as the separate objects.
But the following code doesn’t.
In the log file 140120170615407771_extracted.zip (9.4 KB), you can find "8th text = " which is a concatenated string with the left and center words, “第三者割当増資新経営体制”.
I want them to be “新経営体制” and “第三者割当増資”.

How can I get the words like this separately?

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

using Aspose;
using Aspose.Pdf;
using Aspose.Pdf.Text;

namespace pdf
{
    class Program
    {
        static void Main(string[] args)
        {
            License license = new License();
            license.SetLicense("Aspose.Pdf.lic");
            string pdffile = "140120170615407771.pdf";
            Document doc = new Document(pdffile);
            TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(".+");
            textFragmentAbsorber.TextSearchOptions = new TextSearchOptions(true);
            doc.Pages.Accept(textFragmentAbsorber);
            TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
            int i = 0;
            string extracted = "";
            foreach (TextFragment textFragment in textFragmentCollection)
            {
                try
                {
                    extracted += string.Format("{0}th text = {1}\r\n",++i,textFragment.Text);
                }
                catch (Exception ex)
                {
                    StreamWriter stream = new StreamWriter(pdffile.Replace(".pdf", "_log.txt"));
                    stream.WriteLine("[" + DateTime.Now.ToString() + "]");
                    stream.WriteLine("[message]\r\n " + ex.Message);
                    stream.WriteLine("[source]\r\n " + ex.Source);
                    stream.WriteLine("[stacktrace]\r\n" + ex.StackTrace);
                    stream.Close();
                }
            }
            File.WriteAllText(pdffile.Replace(".pdf", "_extracted.txt"), extracted);
        }
    }
}


I mean not only this file but also in some other cases I face similar situation - separate words in distance position but Aspose.PDF concatenate them into one string - so I want the solution.

Thank you.

asad.ali · June 1, 2018, 4:50pm

@KDSSHO

Thanks for contacting support.

The TextFragmentAbsorber extracts the text in same way it was added inside PDF document. Each TextFragment may have added with several TextSegment. Please modify your code snippet as follows, in order to get desired text separately.

......
foreach (TextFragment textFragment in textFragmentCollection)
{
  try
  {
     foreach (TextSegment ts in textFragment.Segments)
    {
      extracted += string.Format("{0}th text = {1}\r\n", ++i, ts.Text);
    }
  }
 catch (Exception ex)
 {
  StreamWriter stream = new StreamWriter(dataDir + "extractedtext_log.txt");
  stream.WriteLine("[" + DateTime.Now.ToString() + "]");
  stream.WriteLine("[message]\r\n " + ex.Message);
  stream.WriteLine("[source]\r\n " + ex.Source);
  stream.WriteLine("[stacktrace]\r\n" + ex.StackTrace);
  stream.Close();
 }
}
.....

extractedtext_log.zip (12.9 KB)

Please also check 30th and 31st text in attached output. The Japanese string is separately extracted as per your requirement. In case of any further assistance, please feel free to let us know.

KDSDEV · June 4, 2018, 8:11am

Hi Asad,

Thank you so much the following treatment worked well in my environment.

asad.ali · June 4, 2018, 6:29pm

@KDSSHO

Thanks for your feedback.

Please keep using our API and in case of any further assistance, please feel free to let us know by creating a new topic in our forums. We will be happy to assist you.