Difference in behaviour between Aspose.Pdf.Kit.PdfExtractor and Aspose.Pdf.Facades.PdfExtractor

RobRSA · October 1, 2014, 5:57am

Hi

When I use Aspose.Pdf.Facades.PdfExtractor to extract text from a pdf file and save it as a Docx file, after every line of text an empty line gets created. If I implement this using Aspose.Pdf.Kit.PdfExtractor I do not have that problem. Please compare the differences between the attached files to see the effect. I am not able to use PDF.Kit for our Live product. What can I do to fix this?

SetPdfKitAsposeLicence();

//SetPdfAsposeLicence();

//Instantiate PdfExtractor object

var extractor = new Aspose.Pdf.Kit.PdfExtractor { Password = "" };

//var extractor = new Aspose.Pdf.Facades.PdfExtractor { Password = "" };

using (var msIn = new MemoryStream(inFile))

{

using (var msOut = new MemoryStream())

{

//Bind the input PDF document to extractor

extractor.BindPdf(msIn);

extractor.ExtractTextMode = 1;

//Extract text from the input PDF document

extractor.ExtractText();

//Save the extracted text to a text file

extractor.GetText(msOut);

msOut.Position = 0;

var sr = new StreamReader(msOut);

string pdfText = sr.ReadToEnd();

return pdfText;

}

}

Regards Rob

tilal.ahmad · October 1, 2014, 11:49pm

Hi Robert,

Thanks for your inquiry. I am afraid I am unable to replicate the issue while testing the scenario with Aspose.Pdf for .NET 9.6.0. Please download and try latest version of Aspose.Pdf for .NET, it will help you to accomplish the task.

Moreover, If you are not applying some extra processing on extracted text you may try direct conversion of PDF to DOC(X) using Aspose.Pdf instead PDF to text and then DOC(X).

Please feel free to contact us for any further assistance.

Best Regards,

RobRSA · October 2, 2014, 2:07am

Hi

Thanks for the advice.
We are using version 9.4.0.

It works okay if I try to save the the PDF as a Docx, but I notice bullets are not displayed next to the respective text.
eg:

• <o:p></o:p>

Overview of content types

Here is my code:

SetPdfAsposeLicence();

var pdfStream = new MemoryStream(inFile);

var pdfDocument = new Aspose.Pdf.Document(pdfStream);

var documentStream = new MemoryStream();

pdfDocument.Save(documentStream, Aspose.Pdf.SaveFormat.DocX);

byte[] docBytes = new byte[documentStream.Length];

documentStream.Position = 0;

documentStream.Read(docBytes, 0, (int)documentStream.Length);

return docBytes;

I have however been able to meet my requirements by using the code below:

private static byte[] ConvertPdfToWord(byte[] inFile)

{

var documentContents = GetPdfText(inFile);

using (var m = new MemoryStream())

{

var doc = new Aspose.Words.Document();

var builder = new DocumentBuilder(doc);

builder.Writeln(string.IsNullOrEmpty(documentContents) ? " " : documentContents);

var opt = new Aspose.Words.Saving.OoxmlSaveOptions(Aspose.Words.SaveFormat.Docx)

{

Compliance =

OoxmlCompliance

.Iso29500_2008_Transitional

};

doc.Save(m, opt);

return m.ToArray();

}

private static string GetPdfText(byte[] inFile)

{

SetPdfAsposeLicence();

MemoryStream documentStream = new MemoryStream(inFile);

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(documentStream);

System.Text.StringBuilder builder = new System.Text.StringBuilder();

TextExtractionOptions textExtOptions = new

TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);

TextDevice textDevice = new TextDevice();

foreach (Page pdfPage in pdfDocument.Pages)

{

using (var textStream = new MemoryStream())

{

textDevice.ExtractionOptions = textExtOptions;

textDevice.Process(pdfPage, textStream);

textStream.Close();

builder.Append(Encoding.Unicode.GetString(textStream.ToArray()));

}

return builder.ToString();

}

Thanks and Regards
Rob

tilal.ahmad · October 2, 2014, 11:40pm

Hi Robert,

Thanks for your feedback. Please share your source PDF, in which you are facing issue in bullet formatting. We will look into it and will guide you accordingly.

Moreover, I have used Aspose.Pdf for .NET 9.6.0 and Aspose.Word for .NET 14.8.0 and unable to notice the issue. Please download and try latest versions of Aspose APIs, it will resolve the issue.

Best Regards,