PDF to text extraction without line break

rthapliyal · November 3, 2021, 5:10am

Hi,
We are using Aspose.PDF for .Net, I want to extract the text from PDF and apply some search criteria on this text in my application. I am using TextAbsorber but it is adding newline char (\r\n) after each line.
I want to get complete PDF text or each paragraph without any line break. Is it possible using Aspose.PDF?

Thanks,
Rajesh

mudassir.fayyaz · November 3, 2021, 5:03pm

@rthapliyal

I suggest you to please visit the following documentation link for your convenience. If you still encounter issue then please share the source file and desired output.

Extract Paragraph from PDF C#

rthapliyal · November 8, 2021, 5:07am

Thanks @mudassir.fayyaz,

I am trying to extract text from attached Sample.pdf file and able to fetch paragraph wise text without line breaks.
Sample.pdf (200.1 KB)

But sometime it removes spaces between text and sometime it appends double spaces between text, how to fix it.
image.png (62.9 KB)

Code:
Aspose.Pdf.License asposeLicense = new Aspose.Pdf.License();
String _str = @“C:\PDFTest\Aspose.PDF.NET.lic”;
asposeLicense.SetLicense(_str);

    string _dataDir = @"C:\PDFTest\";
    Document doc = new Document(_dataDir + "MSA_Sample.pdf");

    // Instantiate ParagraphAbsorber
    ParagraphAbsorber absorber = new ParagraphAbsorber();
    absorber.Visit(doc);

    StringBuilder paragraphText = new StringBuilder();
    foreach (PageMarkup markup in absorber.PageMarkups)
    {
        int i = 1;
        foreach (MarkupSection section in markup.Sections)
        {
            foreach (MarkupParagraph paragraph in section.Paragraphs)
            {                        
                foreach(TextFragment fragment in paragraph.Fragments)
                {
                    paragraphText.Append(fragment.Text);
                }                        
                paragraphText.Append("\r\n\r\n");

            }
            i++;
        }
    }

    _dataDir = _dataDir + "TXT_Paragraphs.txt";
    // Save the text file
    File.WriteAllText(_dataDir, paragraphText.ToString());

mudassir.fayyaz · November 9, 2021, 11:59am

@rthapliyal

A ticket with ID PDFNET-50867 has been created in our issue tracking system to further investigate the issue on our end. This thread has been linked with the issue so that you may be notified once the issue will be fixed.