PDF to DOCX C# - Text losing Font Styles when converted to DocX

(Note, I believe this is the same issue as stated in: Aspose.Pdf loses text styles, such as bold and other font styles, during converting pdf to docx format but that was for the Java version and this is for the C#/.NET version so I wanted to double check to make sure the issue was present in both and that both need to be fixed: not sure how each version is handled.)

When I’m converting a PDF file to a DocX file, the Font Styles of Bold/Italics are being lost. Here is my code:

    Aspose.Pdf.License pdfLicense = new Aspose.Pdf.License();
    Aspose.Words.License wordLicense = new Aspose.Words.License();
    public Aspose.Pdf.Document PDFDocument { get; set; }
    public Aspose.Words.Document WordDocument { get; set; }
    public string DocumentTitle { get; set; }

    public void DoWork()
    {
        var dir = Directory.GetCurrentDirectory() + "/";
        var licenseFile = dir + "Aspose.Total.NET.lic";
        pdfLicense.SetLicense(licenseFile);
        wordLicense.SetLicense(licenseFile);

        PDFDocument = new Aspose.Pdf.Document(dir + "Equipping-Families.pdf");

        var txtAbsorberStyle = new TextFragmentAbsorber(@"\[([A-Z|a-z|/|~].*?)]");
        txtAbsorberStyle.TextSearchOptions = new TextSearchOptions(true);

        PDFDocument.Pages.Accept(txtAbsorberStyle);

        foreach (TextFragment fragment in txtAbsorberStyle.TextFragments)
        {
            fragment.TextState.FontStyle = FontStyles.Italic;
        }

        PDFDocument.Save(dir + "Equipping-Families-Italics.pdf");

        using (var wordStream = new MemoryStream())
        {
            DocSaveOptions saveOptions = new DocSaveOptions();
            saveOptions.Format = DocSaveOptions.DocFormat.DocX;
            saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
            saveOptions.RelativeHorizontalProximity = 2.5f;
            saveOptions.RecognizeBullets = true;
            PDFDocument.Save(wordStream, saveOptions);
            WordDocument = new Aspose.Words.Document(wordStream);
        }

        var options = new Aspose.Words.Saving.PdfSaveOptions();
        options.SaveFormat = Aspose.Words.SaveFormat.Pdf;
        options.ExportDocumentStructure = true;



        WordDocument.Save(dir + "Equipping-Families-Doc-To-PDF-Convert" + ".pdf", options);
        WordDocument.Save(dir + "Equipping-Families-PDF-To-Doc-Convert" + ".docx");
    }

There are two specific sections of text that I’m looking at:

  • There is bolded text at the top of each page which looses its Bold Font Style on conversion
  • I added italics to several different text areas in the PDF which also lose their Italic Font Style on conversion

Here are the files that are being used/created:
Original PDF: Equipping-Families.pdf (235.5 KB)
PDF post-Italics: Equipping-Families-Italics.pdf (264.8 KB)
DocX post-Conversion: Equipping-Families-PDF-To-Doc-Convert.docx (3.0 MB)
PDF post-Conversion from DocX (I don’t think this is needed for the issue, but wanted to add it just for completeness sake since the code does create it): Equipping-Families-Doc-To-PDF-Convert.pdf (520.3 KB)

@calphius

The Java API has been derived from .NET version and it is highly likely that if issue is present in Aspose.PDF for .NET, it will exist in Java API too. We have logged another ticket as PDFNET-49917 in our issue tracking system to rectify this issue. We will look into its details and let you know once the issue is fixed. Please be patient and spare us some time.

We are sorry for the inconvenience.

Hi @asad.ali, when I tested with both the Aspose.PDF java and c# libraries, they have the same issue, yet the Aspose converter demo website does not.

I opened this question, it may help resolve @calphius’s issue.

@draftwise

We have posted a reply under your original thread that you have created. You may please follow up there.