Extract PDF Text divided by paragraphs using Aspose.PDF for .NET - keep original formatting

natnapaporn.thin · August 17, 2011, 4:19am

Hi Aspose team,
I’ve used Extract Pdf to Text function from Aspose.Pdf.Facades.dll (v.6.0.0), but the result I got is paragraph has changed, the function divided the paragraph by CRLF. Does it has any way to keep the original paragraph?

Thank you in advance.

								<br>

nausherwan.aslam · August 17, 2011, 5:31am

Hi,

Please share your sample code and template file here to reproduce the issue. This will help us understand and identify the issue soon.

Thank You and Best Regards,

natnapaporn.thin · August 17, 2011, 5:59am

Hi,
This is sample code
[C#.NET]
PdfExtractor pdfExtractor = new PdfExtractor();
pdfExtractor.BindPdf(“input.pdf”);
pdfExtractor.ExtractText();
MemoryStream tempMemoryStream = new MemoryStream();
pdfExtractor.GetText(tempMemoryStream);
string text = “”;
using (StreamReader streamReader = new StreamReader(tempMemoryStream, Encoding.Unicode))
{
streamReader.BaseStream.Seek(0, SeekOrigin.Begin);
text = streamReader.ReadToEnd();
}
File.WriteAllText(“output_aspose.txt”, text, Encoding.UTF8);

Thank you in advance.

								<br>

codewarior · August 17, 2011, 1:27pm

Hello Nuch,

Thanks for sharing the resource files.<span style=“font-size:10.0pt;font-family:“Arial”,“sans-serif””>

I
have tested the scenario and I am able to reproduce the same problem. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWNET-29944. We will investigate this
issue in details and will keep you updated on the status of a correction.<span style=“font-size:10.0pt;font-family:“Arial”,“sans-serif””><span style=“font-size:10.0pt;
font-family:“Arial”,“sans-serif””>

We apologize for your inconvenience.

asad.ali · May 27, 2020, 10:54pm

@natnapaporn.thin

We would like to share with you that you can now extract entire paragraph from PDF documents using Aspose.PDF for .NET.