Get the end character or line break of each paragraph when parsing pdf

Hello, I want to know if I can get the end character or line break of each paragraph when parsing pdf

@supeiwei Do you use Aspose.PDF to parse your PDF documents?

yes :grinning: :grinning: :grinning: :grinning: :grinning: :grinning:

@supeiwei I will move your question into Aspose.PDF forum. My colleagues will help you shortly.

Hello, no one replied to me

@supeiwei

Aspose.PDF provides a feature to determine line break that is explained in the below topic(s):

Furthermore, you can use TextFragmentAbsorber class with regular expressions to extract text and other characters from PDF.

In case you face some issues, please share some more details like sample PDF and expected output so that we can test the scenario in our environment and address it accordingly.

I looked at the above method, but it doesn’t help me get the newline character when parsing the PDF. Can you give me a code example?

@supeiwei

Can you please provide the above information so that we can further proceed accordingly?

中华人民共和国公司法2023.pdf (762.9 KB)

What I want to output is the text content of each paragraph separated by line breaks.

@supeiwei

Please confirm if you want to get the output like in the screenshot?
image.png (2.4 KB)

No, my main purpose is to get line breaks so I can segment the text.

@supeiwei

Please check below code sample that we used to detect line breaks from the text content in your PDF. You can modify below approach to get line breaks and achieve your other requirements:

Document pdfDocument = new Document(dataDir + "中华人民共和国公司法2023.pdf");
var tfa = new TextAbsorber();
pdfDocument.Pages.Accept(tfa);
var text = tfa.Text;

// Regex to match line breaks (\r\n, \n, or \r)
MatchCollection matches = Regex.Matches(text, @"\r\n|\r|\n");

if (matches.Count > 0)
{
    Console.WriteLine($"Found {matches.Count} line break(s):");
    foreach (Match match in matches)
    {
        Console.WriteLine($"Line break at position {match.Index}");
    }
}
else
{
    Console.WriteLine("No line breaks found.");
}