How to get a particular paragrapgh from a pdf in c#

AabidH · October 25, 2021, 11:37am

i want the code to get a particular paragraph from a pdf and highlight it.
the code which i used is
Document doc = new Document(“sample.pdf”);
// Create ParagraphAbsorber object

        // Giving Input paragraph 
        string markData = "Vestibulum neque massa, scelerisque sit amet ligula eu, " +
            "congue molestie mi." +
            " Praesent ut  varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum  condimentum.";


        ParagraphAbsorber absorber = new ParagraphAbsorber();
        absorber.Visit(doc);

        foreach (PageMarkup markup in absorber.PageMarkups)
        {
            int i = 1;
            foreach (MarkupSection section in markup.Sections)
            {
                int j = 1;

                foreach (MarkupParagraph paragraph in section.Paragraphs)
                {
                    StringBuilder paragraphText = new StringBuilder();

                    foreach (List<TextFragment> line in paragraph.Lines)
                    {
                        foreach (TextFragment fragment in line)
                        {
                            paragraphText.Append(fragment.Text);
                            fr.TextState.BackgroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Yellow);
                        }
                        paragraphText.Append("\r\n");
                    }
                    paragraphText.Append("\r\n");

                    Console.WriteLine("Paragraph {0} of section {1} on page {2}:", j, i, markup.Number);
                    Console.WriteLine(paragraphText.ToString());

                    j++;
                }
                i++;
            }
        }
        HtmlSaveOptions htmlOptions = new HtmlSaveOptions();

        // Specify to render PDF document layers separately in output HTML
        htmlOptions.ConvertMarkedContentToLayers = true;

        // Save the document
        doc.Save(@"ht.html", htmlOptions);
    }

asad.ali · October 25, 2021, 10:05pm

@AabidH

Could you please share your sample PDF document for our reference? We will test the scenario in our environment and address it accordingly.

AabidH · October 26, 2021, 4:37am

i want to get a particular paragrapgh from a page.
file-sample_150kB.pdf (139.4 KB)

AabidH · October 26, 2021, 12:17pm

i want the code to extract the particular paragraph from a pdf in .net c#

asad.ali · October 26, 2021, 8:03pm

@AabidH

Please try using the below code snippet to extract the multiline paragraph from a PDF document:

Document doc = new Document(dataDir + "file-sample_150kB.pdf");
string searchText = @"Vestibulum\s+neque\s+massa,\s+scelerisque\s+sit\s+amet\s+ligula\s+eu,\s+congue\s+molestie\s+mi.\s+Praesent\s+ut
varius\s+sem.\s+Nullam\s+at\s+porttitor\s+arcu,\s+nec\s+lacinia\s+nisi.\s+Ut\s+ac\s+dolor\s+vitae\s+odio\s+interdum\s+condimentum.";
TextFragmentAbsorber absorber = new TextFragmentAbsorber(searchText, new TextSearchOptions(true));
doc.Pages.Accept(absorber);
string text = "";
foreach(TextFragment fragment in absorber.TextFragments)
{
 text += fragment.Text + " ";
}

AabidH · October 27, 2021, 4:47am

ya i tried ,it only extract some text ,but it cant extract the whole paragraph.So how can i extract that whole paragraph.

asad.ali · October 27, 2021, 9:29pm

@AabidH

We tested the same code snippet in our environment and it was able to extract the paragraph that you have shared in the code snippet in your first post. Can you please share the complete paragraph which you want to extract? We will again test the scenario in our environment and address it accordingly.

AabidH · October 28, 2021, 4:50am

ye that 1st code just extracted the starting part of each page.
i am saying that ,from any pdf ,i want to extract the whole paragrapgh from the page .

from the sample pdf ,the para is below
Etiam vehicula luctus fermentum. In vel metus congue, pulvinar lectus vel, fermentum dui.
Maecenas ante orci, egestas ut aliquet sit amet, sagittis a magna. Aliquam ante quam,
pellentesque ut dignissim quis, laoreet eget est. Aliquam erat volutpat. Class aptent taciti
sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Ut ullamcorper
justo sapien, in cursus libero viverra eget. Vivamus auctor imperdiet urna, at pulvinar leo
posuere laoreet. Suspendisse neque nisl, fringilla at iaculis scelerisque, ornare vel dolor. Ut
et pulvinar nunc. Pellentesque fringilla mollis efficitur. Nullam venenatis commodo
imperdiet. Morbi velit neque, semper quis lorem quis, efficitur dignissim ipsum. Ut ac lorem
sed turpis imperdiet eleifend sit amet id sapien.

so i want to extract the above para and highlight it.

AabidH · October 28, 2021, 5:04am

and wanted to highlight it which was not happening in that code.
Do you have a better code that can get the whole para from pdf other than my code.
And how can we take input para from users and highlight it.
how to identify that the there is text on next line.

asad.ali · October 28, 2021, 8:26pm

@AabidH

We have already logged an investigation ticket as PDFNET-50453 in our issue tracking system to investigate the feasibility to extract and highlight whole paragraphs in the PDF. We have linked it with this forum thread so that you will receive a notification as soon as it is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

AabidH · November 3, 2021, 9:59am

how do we convert a pdf into a single html file ,without a seperate folder for css and other things.

asad.ali · November 3, 2021, 8:22pm

@AabidH

You can please use the following code snippet in order to achieve your requirements of converting PDF to single HTML file:

Document doc = new Document(dataDir + "test.pdf");
HtmlSaveOptions newOptions = new HtmlSaveOptions();
// this is usage of tested feature
newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
// this is just optimozation for IE and can be omitted
newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
newOptions.RemoveEmptyAreasOnTopAndBottom = true;
string outHtmlFile = dataDir + @"output.html";
doc.Save(outHtmlFile, newOptions);

manipriya · April 18, 2022, 11:41am

PDFNET-50453 Is this resolved?

Any update on this?

asad.ali · April 18, 2022, 2:37pm

@manipriya

Regretfully, the earlier logged ticket is not yet resolved. However, if you are facing a similar issue with a different file, please share it with us along with the sample code snippet that you are using. We will test the scenario in our environment and log a separate issue for specific file in our issue tracking system.