Extract text with coordinates and rotation

vladimirborisovich · July 24, 2017, 9:57am

Dear Sir,
first time I extract text with such a code:

TextDevice textDevice = new TextDevice();
TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
textDevice.ExtractionOptions = textExtOptions;
textDevice.Process(pdfPage, textStream);

Then I split text into lines and try to find them in document in order to return with once again, but with position in this way:

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(SomeLineOfText);
pdfPageLast.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
TextFragment textSegment in textFragmentCollection
string spt = textSegment.Position;

It takes a lot of time!!!
Is it possible to obtain text string and it’s position Simultaneously. It takes a long time to search some text only to get it’s p[osition. And how to define it’s rotation? Nor text baseline, nor it’s rectangle give any idea of such very needed data?

asad.ali · July 24, 2017, 1:44pm

@vladimirborisovich

Thanks for contacting support.

Would you please share your sample document, so that we can test the scenario in our environment and respond you accordingly.

vladimirborisovich · July 25, 2017, 5:44am

Result_PDF.pdf (520.4 KB)
Input_PDF.zip (329.6 KB)
I am using Aspose.Pdf_11.4.0.msi for needs of my organisation only
150324084925
54806

vladimirborisovich · July 25, 2017, 7:22am

Here is given used code for text extraction:

public void ExtractText(string fileName, string TempPath, string PDFname)
{
Document pdfDocument = new Document(fileName);
System.Text.StringBuilder buil = new System.Text.StringBuilder();
//string to hold extracted text
string extractedText = “”;

        List<string> Text_Pos = new List<string>();
        List<Aspose.Pdf.Rectangle> Text_Rect = new List<Aspose.Pdf.Rectangle>();
        string[] myStr = new string[10];
       
        Page pdfPageLast = pdfDocument.Pages[1];
        int iPage = 1;

        foreach (Page pdfPage in pdfDocument.Pages)
        {
            string txtFile = TempPath + PDFname + "_" + iPage.ToString() + ".TXT";

            using (MemoryStream textStream = new MemoryStream())
            {
                TextDevice textDevice = new TextDevice();
                TextDevice textDevice1 = new TextDevice();
                pdfPageLast = pdfPage;
                TextExtractionOptions textExtOptions = new
                TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
                textDevice.ExtractionOptions = textExtOptions;

                //convert a particular page and save text to the stream
                textDevice.Process(pdfPage, textStream);
                
                //obtainibg text from PDF-page & closing memory stream & spliting text on lines
                extractedText = Encoding.Unicode.GetString(textStream.ToArray());
                textStream.Close();
                myStr = extractedText.Split(new char[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);

                // Searching for the position of each found line of text
                foreach (string Lin in myStr)
                {
                    string TrimLin = Lin.Trim();
                    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(TrimLin);
                    pdfPageLast.Accept(textFragmentAbsorber);
                    TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
                    foreach (TextFragment textSegment in textFragmentCollection)
                    {
                        string AcadText = textSegment.Position + textSegment.Text;

                        // exclude repeated lines of the text
                        // 1. Simular text and position
                        if (Text_Pos.Contains(AcadText)) continue;
                        Aspose.Pdf.Rectangle rectA = textSegment.Rectangle;

                        // 2. Repetitive parts of the same text
                        bool her = true;
                        foreach (Aspose.Pdf.Rectangle rectB in Text_Rect)
                        {
                            Aspose.Pdf.Rectangle rectC = rectA.Intersect(rectB); 
                            if (rectC != null) { her = false; break; }
                        }
                        if (her) Text_Rect.Add(rectA); else continue;
                        Text_Pos.Add(AcadText);

                        // Not significant - changing characters from existed to needed
                        string ParmValue = Transliteration.TransliterateFromAvevaToRus(textSegment.Text);

                        buil.Append("Text : " + ParmValue + "\n");
                        buil.Append("Position :" + textSegment.Position + "\n"); break;

                    }  // foreach (TextFragment textSegment in textFragmentCollection)
                }  // foreach (string Lin in myStr)

                string ResFilNam = TempPath + PDFname + "_" + iPage.ToString() + ".TXT"; iPage++;
                File.WriteAllText(ResFilNam, buil.ToString());
                buil.Clear();

            }  //  using (MemoryStream textStream = new MemoryStream())
         }  //foreach (Page pdfPage in pdfDocument.Pages)
    }   // public void ExtractText()

asad.ali · July 25, 2017, 2:38pm

@vladimirborisovich

Thanks for sharing sample document(s).

In order to extract all text, determine its position and rotation, you may achieve it while using only TextFragmentAbsorber. I have tested the scenario with latest version of the API (i.e Aspose.Pdf for .NET 17.7) while using TextFragmentAbsorber and unable to notice any delay in the execution. The required information was extracted within ~1 sec.

Please check following code snippet which I have tried in our environment to test the scenario.

Document pdfDocument = new Document(dataDir + "Input_PDF.pdf");
System.Text.StringBuilder buil = new System.Text.StringBuilder();
List<string> Text_Pos = new List<string>();
List<Aspose.Pdf.Rectangle> Text_Rect = new List<Aspose.Pdf.Rectangle>();
string[] myStr = new string[10];

Page pdfPageLast = pdfDocument.Pages[1];
int iPage = 1;

foreach (Page pdfPage in pdfDocument.Pages)
{
 string txtFile = dataDir + "_" + iPage.ToString() + ".TXT";
 TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
 pdfPageLast.Accept(textFragmentAbsorber);
 TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
 foreach (TextFragment textSegment in textFragmentCollection)
 {
  string AcadText = textSegment.Position + textSegment.Text;
  // exclude repeated lines of the text
  // 1. Similar text and position
  if (Text_Pos.Contains(AcadText)) continue;
  Aspose.Pdf.Rectangle rectA = textSegment.Rectangle;

  // 2. Repetitive parts of the same text
  bool her = true;
  foreach (Aspose.Pdf.Rectangle rectB in Text_Rect)
  {
   Aspose.Pdf.Rectangle rectC = rectA.Intersect(rectB);
   if (rectC != null) { her = false; break; }
  }
  if (her) Text_Rect.Add(rectA); else continue;
  Text_Pos.Add(AcadText);
  string ParmValue = textSegment.Text;
  buil.Append("Text : " + ParmValue + "\n");
  buil.Append("Position :" + textSegment.Position + "\n");
  buil.Append("Rotation :" + textSegment.TextState.Rotation.ToString() + "\n"); //break;
 }  // foreach (TextFragment textSegment in textFragmentCollection)

 string ResFilNam = dataDir + "_" + iPage.ToString() + ".TXT"; iPage++;
 File.WriteAllText(ResFilNam, buil.ToString());
 buil.Clear();
}

Furthermore, you may check in the above code snippet that Rotation of the TextFragment can be determined by getting TextFragment.TextState.Rotation property. Though, during testing the scenario, I have observed that rotation of all the text was being returned as zero, which was incorrect because some of the text has rotation angle defined in the PDF.

Hence, I have logged an issue as PDFNET-43113 in our issue tracking system. We will further investigate the issue and keep you informed with the status of its correction. Please be patient and spare us little time. As far as the execution time taking by API is concerned, we recommend you, please upgrade the API to the latest version.

Meanwhile, would you please share more details regarding the PDF (Result_PDF.pdf) that, how you are creating it. As I could see the issue which have highlighted inside the document. We will test the scenario in our environment and address it accordingly.

We are sorry for the inconvenience.

aspose.notifier · March 12, 2020, 8:55pm

The issues you have found earlier (filed as PDFNET-43113) have been fixed in Aspose.PDF for .NET 20.3.