Difference in behaviour between Aspose.Pdf.Kit.PdfExtractor and Aspose.Pdf.Facades.PdfExtractor

Hi

When I use Aspose.Pdf.Facades.PdfExtractor to extract text from a pdf file and save it as a Docx file, after every line of text an empty line gets created. If I implement this using Aspose.Pdf.Kit.PdfExtractor I do not have that problem. Please compare the differences between the attached files to see the effect. I am not able to use PDF.Kit for our Live product. What can I do to fix this?

SetPdfKitAsposeLicence();
//SetPdfAsposeLicence();

//Instantiate PdfExtractor object
var extractor = new Aspose.Pdf.Kit.PdfExtractor { Password = "" };
//var extractor = new Aspose.Pdf.Facades.PdfExtractor { Password = "" };
using (var msIn = new MemoryStream(inFile))
{
using (var msOut = new MemoryStream())
{
//Bind the input PDF document to extractor
extractor.BindPdf(msIn);

extractor.ExtractTextMode = 1;

//Extract text from the input PDF document
extractor.ExtractText();

//Save the extracted text to a text file
extractor.GetText(msOut);

msOut.Position = 0;
var sr = new StreamReader(msOut);

string pdfText = sr.ReadToEnd();

return pdfText;
}
}

Regards Rob

Hi Robert,

Thanks for your inquiry. I am afraid I am unable to replicate the issue while testing the scenario with Aspose.Pdf for .NET 9.6.0. Please download and try latest version of Aspose.Pdf for .NET, it will help you to accomplish the task.

Moreover, If you are not applying some extra processing on extracted text you may try direct conversion of PDF to DOC(X) using Aspose.Pdf instead PDF to text and then DOC(X).

Please feel free to contact us for any further assistance.

Best Regards,

Hi


Thanks for the advice.
We are using version 9.4.0.

It works okay if I try to save the the PDF as a Docx, but I notice bullets are not displayed next to the respective text.
eg:

• <o:p></o:p>

Overview of content types



Here is my code:

SetPdfAsposeLicence();

var pdfStream = new MemoryStream(inFile);

var pdfDocument = new Aspose.Pdf.Document(pdfStream);


var documentStream = new MemoryStream();

pdfDocument.Save(documentStream, Aspose.Pdf.SaveFormat.DocX);


byte[] docBytes = new byte[documentStream.Length];

documentStream.Position = 0;

documentStream.Read(docBytes, 0, (int)documentStream.Length);


return docBytes;


I have however been able to meet my requirements by using the code below:

private static byte[] ConvertPdfToWord(byte[] inFile)

{

var documentContents = GetPdfText(inFile);


using (var m = new MemoryStream())

{

var doc = new Aspose.Words.Document();

var builder = new DocumentBuilder(doc);


builder.Writeln(string.IsNullOrEmpty(documentContents) ? " " : documentContents);


var opt = new Aspose.Words.Saving.OoxmlSaveOptions(Aspose.Words.SaveFormat.Docx)

{

Compliance =

OoxmlCompliance

.Iso29500_2008_Transitional

};


doc.Save(m, opt);


return m.ToArray();

}

}


private static string GetPdfText(byte[] inFile)

{

SetPdfAsposeLicence();


MemoryStream documentStream = new MemoryStream(inFile);

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(documentStream);

System.Text.StringBuilder builder = new System.Text.StringBuilder();

TextExtractionOptions textExtOptions = new

TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);


TextDevice textDevice = new TextDevice();


foreach (Page pdfPage in pdfDocument.Pages)

{

using (var textStream = new MemoryStream())

{

textDevice.ExtractionOptions = textExtOptions;

textDevice.Process(pdfPage, textStream);

textStream.Close();

builder.Append(Encoding.Unicode.GetString(textStream.ToArray()));

}

}


return builder.ToString();

}


Thanks and Regards
Rob

Hi Robert,


Thanks for your feedback. Please share your source PDF, in which you are facing issue in bullet formatting. We will look into it and will guide you accordingly.

Moreover, I have used Aspose.Pdf for .NET 9.6.0 and Aspose.Word for .NET 14.8.0 and unable to notice the issue. Please download and try latest versions of Aspose APIs, it will resolve the issue.

Best Regards,