Convert PDF to Text with formatting - Urgent!

dewan.ishi · August 11, 2020, 2:39pm

Hi Team,

Hi I am trying to convert pdf to txt. But when i convert i loose out the formatting. Is there a way i can keep the formatting intact like bold and font size.

I am using the below code.

Document doc = new Document(@"C:\Users\abcd\Downloads\Ppart1\Batch 9.pdf");
Aspose.Pdf.Text.TextAbsorber textAbsorber = new Aspose.Pdf.Text.TextAbsorber();
doc.Flatten();
doc.Pages.Accept(textAbsorber);
string[] returnValue = textAbsorber.Text.Split(new string[] { System.Environment.NewLine }, StringSplitOptions.None);
File.WriteAllText(@"C:\Users\abcd\Downloads\TextFilesForPart1\Batch 9.txt", textAbsorber.Text);

asad.ali · August 11, 2020, 9:36pm

@dewan.ishi

Could you please share your sample source PDF and expected .txt file with us. Also, please share that in which application/utility you want to view output .txt file with all formatting. We will test the scenario in our environment and address it accordingly.

dewan.ishi · August 13, 2020, 10:30am

Hi,

About the above problem we are trying to convert some forms into excel/txt/or any other format. So i wanted to retain bold and underline in a text file when i convert pdf to txt. Or is there a way that i can go through the pdf even the scanned ones line by line and find out if the text encountered is bold and underline.
Is there any alternate way in aspose which I can use to convert scanned pdf and normal pdf to a useable format with all the formatting intact?
Can I manipulate the above shared code and use textwriter to save the string array ‘returnvalue’ to the txt file rather than saving the text absorber.

asad.ali · August 13, 2020, 8:46pm

@dewan.ishi

In order to retain formatting, you need to convert the PDF document into a file format which supports it e.g. DOC/DOCX, Excel, HTML, etc. Please check following articles in API documentation to convert PDF into other file formats supported by Aspose.PDF.

Convert PDF to other file formats

You can convert scanned PDF pages to images and perform an OCR operation on obtained images using Aspose.OCR.

Once a string is returned from TextAbsorber, you can use any method as per your convenience to save it in .txt file.

In case of further query, please feel free to ask.