(Note, I believe this is the same issue as stated in: Aspose.Pdf loses text styles, such as bold and other font styles, during converting pdf to docx format but that was for the Java version and this is for the C#/.NET version so I wanted to double check to make sure the issue was present in both and that both need to be fixed: not sure how each version is handled.)
When I’m converting a PDF file to a DocX file, the Font Styles of Bold/Italics are being lost. Here is my code:
Aspose.Pdf.License pdfLicense = new Aspose.Pdf.License(); Aspose.Words.License wordLicense = new Aspose.Words.License(); public Aspose.Pdf.Document PDFDocument { get; set; } public Aspose.Words.Document WordDocument { get; set; } public string DocumentTitle { get; set; } public void DoWork() { var dir = Directory.GetCurrentDirectory() + "/"; var licenseFile = dir + "Aspose.Total.NET.lic"; pdfLicense.SetLicense(licenseFile); wordLicense.SetLicense(licenseFile); PDFDocument = new Aspose.Pdf.Document(dir + "Equipping-Families.pdf"); var txtAbsorberStyle = new TextFragmentAbsorber(@"\[([A-Z|a-z|/|~].*?)]"); txtAbsorberStyle.TextSearchOptions = new TextSearchOptions(true); PDFDocument.Pages.Accept(txtAbsorberStyle); foreach (TextFragment fragment in txtAbsorberStyle.TextFragments) { fragment.TextState.FontStyle = FontStyles.Italic; } PDFDocument.Save(dir + "Equipping-Families-Italics.pdf"); using (var wordStream = new MemoryStream()) { DocSaveOptions saveOptions = new DocSaveOptions(); saveOptions.Format = DocSaveOptions.DocFormat.DocX; saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow; saveOptions.RelativeHorizontalProximity = 2.5f; saveOptions.RecognizeBullets = true; PDFDocument.Save(wordStream, saveOptions); WordDocument = new Aspose.Words.Document(wordStream); } var options = new Aspose.Words.Saving.PdfSaveOptions(); options.SaveFormat = Aspose.Words.SaveFormat.Pdf; options.ExportDocumentStructure = true; WordDocument.Save(dir + "Equipping-Families-Doc-To-PDF-Convert" + ".pdf", options); WordDocument.Save(dir + "Equipping-Families-PDF-To-Doc-Convert" + ".docx"); }
There are two specific sections of text that I’m looking at:
- There is bolded text at the top of each page which looses its Bold Font Style on conversion
- I added italics to several different text areas in the PDF which also lose their Italic Font Style on conversion
Here are the files that are being used/created:
Original PDF: Equipping-Families.pdf (235.5 KB)
PDF post-Italics: Equipping-Families-Italics.pdf (264.8 KB)
DocX post-Conversion: Equipping-Families-PDF-To-Doc-Convert.docx (3.0 MB)
PDF post-Conversion from DocX (I don’t think this is needed for the issue, but wanted to add it just for completeness sake since the code does create it): Equipping-Families-Doc-To-PDF-Convert.pdf (520.3 KB)