Paragraph Text Property

CsMaster1984 · February 14, 2022, 7:24pm

I am wondering if there is a way to extract the paragraph text in a raw format since the Paragraph text property returning trimmed string, same for Fragment Text and Segment Text While using ParagraphAbsorber

I have tried TextDevice to extract text and you have provided an option to extract data Pure or Raw which is good but i need to use ParagraphAbsorber since i need the exact location of the Text, below code snippet

private static void ParseDocument(Aspose.Pdf.Document doc)
{
Aspose.Pdf.Text.ParagraphAbsorber absorber = new Aspose.Pdf.Text.ParagraphAbsorber();

        absorber.Visit(doc);

        for (int pageIndex = 0; pageIndex < absorber.PageMarkups.Count; pageIndex++)
        {
            Aspose.Pdf.Text.PageMarkup markup = absorber.PageMarkups[pageIndex];

            for (int sectionIndex = 0; sectionIndex < markup.Sections.Count; sectionIndex++)
            {
                Aspose.Pdf.Text.MarkupSection section = markup.Sections[sectionIndex];

                for (int paragraphIndex = 0; paragraphIndex < section.Paragraphs.Count; paragraphIndex++)
                {
                    Aspose.Pdf.Text.MarkupParagraph paragraph = section.Paragraphs[paragraphIndex];
					Console.WriteLine($"Paragraph {paragraphIndex} Text: {paragraph.Text}");

                    for (int lineIndex = 0; lineIndex < paragraph.Lines.Count; lineIndex++)
                    {
                        List<Aspose.Pdf.Text.TextFragment> line = paragraph.Lines[lineIndex];

                        for (int fragmentIndex = 0; fragmentIndex < line.Count; fragmentIndex++)
                        {
                            Aspose.Pdf.Text.TextFragment fragment = line[fragmentIndex];
							Console.WriteLine($"Fragment {fragmentIndex} Text: {fragment.Text}");

                            for (int segmentIndex = 0; segmentIndex < fragment.Segments.Count; segmentIndex++)
                            {
                                Aspose.Pdf.Text.TextSegment segment = fragment.Segments[segmentIndex + 1];
								Console.WriteLine($"Segment {segmentIndex} Text: {segment.Text}");
                            }
                        }
                    }
                }
            }
        }
    }

asad.ali · February 14, 2022, 10:48pm

@CsMaster1984

Can you please share a bit more details like how you are getting the trimmed value? Please share a sample PDF document along with the output details that you are getting and an expected output sample so that we may further proceed to assist you accordingly.

CsMaster1984 · February 19, 2022, 12:48pm

Crystal Reports - PrintTest - 20220213 203637315.pdf (169.7 KB)
ParagraphAbsorber Output.png (37.8 KB)
TextDevice Extraction.png (6.7 KB)

Hi asad.ali
I have attached the original PDF document and also result of Paragraph Absorber output and output when i tried using textdevice-text extraction,

I have highlighted the output of first paragraph using both options, as you can see from paragraph absorber output the spaces are not there unlike text extraction the spaces and full line text is there

thanks in advance

asad.ali · February 19, 2022, 8:11pm

@CsMaster1984

Thanks for sharing more details.

Please confirm do you want to get the exact coordinates of a paragraphs on PDF Page? We will try to create a code example accordingly and share it with you. Please share how do you want to utilize the obtained location of the paragraph text?

CsMaster1984 · February 19, 2022, 11:30pm

i need to get each element with related raw text, so yes i need the exact paragraphs location and related raw text

the main idea of my project is extracting the data of any pdf document and the exact location of each part inside the file for audit purposes as i need this information to verify it with the data submitted to another ERP system

asad.ali · February 20, 2022, 12:09pm

@CsMaster1984

Please try to use the TextFragmentAbsorber like below and see if extracted text location is correct and suitable for your further use:

Aspose.Pdf.Document doc = new Aspose.Pdf.Document(dataDir + "inbput.pdf");

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
doc.Pages.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

// Loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
 var rect = textFragment.Rectangle;
}

CsMaster1984 · February 21, 2022, 3:23pm

Thanks for your assistance

i have tried the code snippet you have provided but the extracted text is not what i expected
From the PDF document i have shared earlier the first line contains text at the middle “Test Report - Name” and another text at the right side “2 February 2022”
the result has correct rectangle, but the spaces between
“Test Report - Name” and “2 February 2022” has been trimmed as per the below fragment text
i expect to get
" Test Report - Name 2 February 2022"

Trace Log:
Fragment Text: [Test Report - Name 2 February 2022]
Position: [367.55999025, 554.338526123793]
Rectangle: [367.55999025, 554.338526123793, 769.581070613929, 565.294113630087]

Thank you again for your time and consideration

CsMaster1984 · February 21, 2022, 3:25pm

it seems reply paragraph also trimmed spaces

i expect to get
“\t\t\t Test Report - Name \t\t\t\t 2 February 2022”
i used \t to represent some spaces between text

asad.ali · February 21, 2022, 8:54pm

@CsMaster1984

Thanks for the feedback.

What we have understood is that using TextFragmentAbsorber serves the needs of getting actual position of the text but it trims the spaces used between words. Whereas, using TextAbsorber or TextDevice, you are able to get the text in Raw Format but it lacks position attributes. Please confirm if we got the issue right so that we can proceed logging an issue and share the ID with you.

CsMaster1984 · February 22, 2022, 9:34pm

yes, you got it right, that what i meant exactly.
as a workaround i have added an extension function to extract the data from the location produced by TextFragmentAbsorber, as per the below snippet and that hepled me to fix my issue

public static string ExtractText(this Aspose.Pdf.Document document, int PageIndex, Aspose.Pdf.Rectangle rect)
{
Aspose.Pdf.Text.TextAbsorber absorber = new Aspose.Pdf.Text.TextAbsorber();
absorber.TextSearchOptions.LimitToPageBounds = true;
absorber.ExtractionOptions = new Aspose.Pdf.Text.TextExtractionOptions(Aspose.Pdf.Text.TextExtractionOptions.TextFormattingMode.Pure);
absorber.TextSearchOptions.Rectangle = new Aspose.Pdf.Rectangle(rect.LLX, rect.LLY, rect.URX, rect.URY);

         document.Pages[PageIndex].Accept(absorber);

        return absorber.Text;
    }

asad.ali · February 23, 2022, 12:21am

@CsMaster1984

It is good to know that you were able to sort out the issue you were facing by adopting a workaround. Furthermore, we have also logged an investigation ticket as PDFNET-51411 in our issue tracking system for further analysis. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.