Get the end character or line break of each paragraph when parsing pdf

supeiwei · November 29, 2024, 9:19am

Hello, I want to know if I can get the end character or line break of each paragraph when parsing pdf

alexey.noskov · November 29, 2024, 10:58am

@supeiwei Do you use Aspose.PDF to parse your PDF documents?

supeiwei · November 30, 2024, 1:39am

yes

alexey.noskov · November 30, 2024, 6:03am

@supeiwei I will move your question into Aspose.PDF forum. My colleagues will help you shortly.

supeiwei · November 30, 2024, 8:32am

Hello, no one replied to me

asad.ali · November 30, 2024, 9:09pm

@supeiwei

Aspose.PDF provides a feature to determine line break that is explained in the below topic(s):

Furthermore, you can use TextFragmentAbsorber class with regular expressions to extract text and other characters from PDF.

Search and Get Text from Pages of PDF|Aspose.PDF for .NET

In case you face some issues, please share some more details like sample PDF and expected output so that we can test the scenario in our environment and address it accordingly.

supeiwei · December 3, 2024, 1:33am

I looked at the above method, but it doesn’t help me get the newline character when parsing the PDF. Can you give me a code example?

asad.ali · December 3, 2024, 12:59pm

@supeiwei

Can you please provide the above information so that we can further proceed accordingly?

supeiwei · December 4, 2024, 2:29am

中华人民共和国公司法2023.pdf (762.9 KB)

What I want to output is the text content of each paragraph separated by line breaks.

asad.ali · December 4, 2024, 12:41pm

@supeiwei

Please confirm if you want to get the output like in the screenshot?
image.png (2.4 KB)

supeiwei · December 5, 2024, 1:25am

No, my main purpose is to get line breaks so I can segment the text.

asad.ali · December 5, 2024, 4:06pm

@supeiwei

Please check below code sample that we used to detect line breaks from the text content in your PDF. You can modify below approach to get line breaks and achieve your other requirements:

Document pdfDocument = new Document(dataDir + "中华人民共和国公司法2023.pdf");
var tfa = new TextAbsorber();
pdfDocument.Pages.Accept(tfa);
var text = tfa.Text;

// Regex to match line breaks (\r\n, \n, or \r)
MatchCollection matches = Regex.Matches(text, @"\r\n|\r|\n");

if (matches.Count > 0)
{
    Console.WriteLine($"Found {matches.Count} line break(s):");
    foreach (Match match in matches)
    {
        Console.WriteLine($"Line break at position {match.Index}");
    }
}
else
{
    Console.WriteLine("No line breaks found.");
}