Hello, I want to know if I can get the end character or line break of each paragraph when parsing pdf
yes
Hello, no one replied to me
Aspose.PDF provides a feature to determine line break that is explained in the below topic(s):
- Determine Line Break In PDF File | Aspose.PDF for .NET API Reference
- Determine Line Break|Aspose.PDF for .NET
Furthermore, you can use TextFragmentAbsorber class with regular expressions to extract text and other characters from PDF.
In case you face some issues, please share some more details like sample PDF and expected output so that we can test the scenario in our environment and address it accordingly.
I looked at the above method, but it doesn’t help me get the newline character when parsing the PDF. Can you give me a code example?
中华人民共和国公司法2023.pdf (762.9 KB)
What I want to output is the text content of each paragraph separated by line breaks.
No, my main purpose is to get line breaks so I can segment the text.
Please check below code sample that we used to detect line breaks from the text content in your PDF. You can modify below approach to get line breaks and achieve your other requirements:
Document pdfDocument = new Document(dataDir + "中华人民共和国公司法2023.pdf");
var tfa = new TextAbsorber();
pdfDocument.Pages.Accept(tfa);
var text = tfa.Text;
// Regex to match line breaks (\r\n, \n, or \r)
MatchCollection matches = Regex.Matches(text, @"\r\n|\r|\n");
if (matches.Count > 0)
{
Console.WriteLine($"Found {matches.Count} line break(s):");
foreach (Match match in matches)
{
Console.WriteLine($"Line break at position {match.Index}");
}
}
else
{
Console.WriteLine("No line breaks found.");
}