Aspose PDF to Text- special characters issue

Muzna_Tariq · August 25, 2017, 12:30pm

Hi,

I am converting PDF file to text and I have observed some special characters exists in the output text file.

Code Snippet:

         string dataDir = @"E:\TestCases\";    
        foreach (string file in Directory.EnumerateFiles(dataDir, "*.pdf"))
        {
            Document pdfDocument = new Document(file);
            //create TextAbsorber object to extract text
            Aspose.Pdf.Text.TextAbsorber textAbsorber = new Aspose.Pdf.Text.TextAbsorber();
            //accept the absorber for all the pages
            pdfDocument.Pages.Accept(textAbsorber);
            //get the extracted text
            string extractedText = textAbsorber.Text;
            // create a writer and open the file
            string fileName = Path.ChangeExtension(file, "txt");
            TextWriter tw = new StreamWriter(dataDir + "\\PDFtoText\\" + Path.GetFileName(fileName));
            // write a line of text to the file
            tw.WriteLine(extractedText);
            // close the stream
            tw.Close();        
        }

Kindly investigate the issue and help me to fix it. I have test it on Araxis merge.

Please find the attached document having input PDF file and output text file and snapshot of output file in Araxis merge from this link.

Thanks.

asad.ali · August 25, 2017, 1:59pm

@Muzna_Tariq

Thanks for contacting support.

I have tested your shared TXT file and compared it with repective PDF (test.pdf), in Araxis Merge utility and was unable to notice special character in the text file. For your reference, I have also attached a screenshot showing compare results and version details of the utility.

Araxis_Comparison.png (95.1 KB)

It seems your environment specific scenario, so would you please try copy/pasting the text from TXT file in MS Word and check if there are still special characters, and please share results with us along with details of OS Version, locale settings etc. So that we can try to replicate the scenario in our environment and address it accordingly.

Muzna_Tariq · August 28, 2017, 7:20am

@asad.ali
Can you please share the converted text file. I want to test it in my environment as well.

Environment details:
Operating System: Windows 10
Processor: Intel core-i5-2300
RAM: 8.0GB
System Type: 64 bit Operating system

I have copy/pasted TXT file in MS Word and no special characters exist. But when I open converted TXT file in Notepad++ (Version: 7.5), special character exists in Notepad++.

Please find the attached snapshot.
snapshot.png (40.7 KB)

asad.ali · August 28, 2017, 11:39am

@Muzna_Tariq

Thanks for contacting support.

I have tested the scenario in the same environment which details, you have shared and was able to notice that some special character were appeared, when I have copied/pasted the text in Notepad++ v7.5. However when I converted the encoding to ANSI from UTF-8, the resultant text was fine. For your reference, I have attached screenshots as well.

Converted_Encoding.png (41.1 KB)
Convert_to_ANSI.png (45.6 KB)

It seems that the text in the PDF is encoded and when you need to change the encoding, in order to view text correctly. Furthermore, as it also seems specific document issue, so would you please confirm if you are experiencing this issue with all PDF documents or only this one. We will log an investigation ticket if necessary and address it accordingly.