Hello.
Library used here is
Aspose.PDF.Drawing [25.6.0], but tested with [25.5.0] as well.
We are using linux containers using amazon linux 2023 base image, and when trying to convert
a PDF document to Docx, we get some weird behavior that is not the same under windows.
It seems that random words in the generated Docx get new empty spaces mid-words, for example the text “Thomas Edison” in PDF results in "T homas E dison " in the Docx generated under linux.
Attached here is an archive with the original PDF used “Thomas Edison.pdf” as well as the generated Docx (the docx names contain the versions tested, both 25.5.0 and 25.6.0)
Edison.7z (1010.0 KB)
The bigger problem here is that the hyperlinked words are also “split” and thus resulting in a lot of hyperlink duplication in the docx itself, leading to issues on our end when we extract the content and/or further convert the docx to HTML using Aspose.Words
The conversion code used is the following:
Assembly asposePdfAssembly = Assembly.GetAssembly(typeof(Aspose.Pdf.Document));
Version asposeVersion = asposePdfAssembly.GetName().Version;
var pdfFilePath = “/app/Thomas Edison.pdf”;
var asposePdfSavePath = “/app/Thomas Edison”;
using (var pdfDocument = new Aspose.Pdf.Document(pdfFilePath))
{
var saveOptions = new DocSaveOptions
{
Format = DocSaveOptions.DocFormat.DocX,
Mode = DocSaveOptions.RecognitionMode.Flow,
RecognizeBullets = true,
AddReturnToLineEnd = false,
RelativeHorizontalProximity = 2.5f
};
try
{
pdfDocument.Save($“{asposePdfSavePath}{asposeVersion}.docx”, saveOptions);
Console.WriteLine(“<> OK”);
}
catch (Exception ex)
{
throw ex;
}
}
Please confirm if this reproduces on your end or do we need to change something regarding our options, thank you!