Extract text with rectangle, the result has lots of break line

Glority_Developer · March 24, 2021, 8:46am

Hi,
I want to copy the select text from pdf.
First I choose a Rectangle area and using its coordinates to init TextAbosrber.TextSearchOptions.Rectangle. and then I selected a page to accept this textAbsorber.
Finally, I found that the textAbsorber.Text has lots of break lines in the end. sometime it also will be displayed between each line.
here is sample code:

     var absorber = new TextAbsorber();
     absorber.TextSearchOptions.LimitToPageBounds = true;
     var ltPoint = CalculateSelectedAreaOnPage(SelectRectLeftTopPoint);
     var rbPoint = CalculateSelectedAreaOnPage(SelectRectRightBottomPoint);
     absorber.TextSearchOptions.Rectangle = (new Aspose.Pdf.Rectangle(ltPoint.X, ltPoint.Y, >rbPoint.X, rbPoint.Y));
     CurPreviewPage.Accept(absorber);
     string extractedText = absorber.Text;

Please ingore the function CalculateSelectedAreaOnpage(), it just a function to convert point.
Please help me check the extract text result. thanks.

asad.ali · March 24, 2021, 5:33pm

@Glority_Developer

Could you please explain a bit more by sharing your sample source PDF and expected output text information. We will test the scenario in our environment and address it accordingly.

Glority_Developer · March 25, 2021, 10:42am

Thanks for your replay,
Here.pdf (95.8 KB)
and a select the rectange such as SelectedRect.png (55.5 KB)
and the extracted text result is extractedTextResult.png (12.2 KB)
you can see I have selected all the text, so that you can see the extra breakline
sometimes linebreak will appear in line space, please check it.
if you have any problem, please let me know.
thanks.

asad.ali · March 25, 2021, 9:37pm

@Glority_Developer

We need these values which you have specified in your above line of code. This way we will be able to test the scenario in our environment accordingly and share our feedback with you.

Glority_Developer · March 26, 2021, 1:53am

Hi,
the ltPoint.X = 71.250475;
the ltPoint.Y = 485.250638;
the rbPoint.X = 543.749519;
the rbPoint.Y = 425.250763;
thanks

asad.ali · March 26, 2021, 3:29pm

@Glority_Developer

Please try to use the code snippet as below in order to extract text without spaces:

Document doc = new Document(dataDir + "extractTextTextPDF.pdf");
TextAbsorber ta = new TextAbsorber();
ta.TextSearchOptions.Rectangle = (new Aspose.Pdf.Rectangle(71.250475, 485.250638, 543.749519, 425.250763));
ta.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
doc.Pages.Accept(ta);
string text = ta.Text;