Bold issue for some text while converting from PDF to DOC

Hi,

Using the aspose-cells and aspose version 22.12, and aspose-pdf 22.9, there seems to be bold issue for some text.
bold.pdf (171.8 KB)
I am using the below code.

def dataDir = "/Users/anil.maharjan/Documents/full_template/"             
Document doc = new Document(dataDir + "bold.pdf");

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
doc.getPages().accept(textFragmentAbsorber);
Font arial_unicode_ms = FontRepository.findFont("Arial");
for (TextFragment textFragment : textFragmentAbsorber.getTextFragments())
{
    TextFragmentState textState = textFragment.getTextState();
    if ("Arial-BoldMT".equals(textState.getFont().getFontName()))
    {
        textState.setFont(arial_unicode_ms);
        textState.setFontStyle(FontStyles.Bold);
    }
}

DocSaveOptions saveOption = new DocSaveOptions();
saveOption.setFormat(DocSaveOptions.DocFormat.DocX);

doc.save(dataDir + "1.docx", saveOption);

@anilmhjn Your question is related to Aspose.PDF. I will move your request into Aspose.PDF forum. My colleagues will help you shortly.

@anilmhjn,

Can you please tell me exactly what text seems to have an issue with bold style?

I just ran your code in the Aspose pdf and got the right text conversion. Then I run it without the font exchange and it also seemed okay. I am attaching the output files I got.

Output with font change to Arial:WithArial_output.docx (150.2 KB)

Output without font change: OriginalFont_output.docx (172.8 KB)

Hi carlosmc,
Thank you for your quick response.

The word conversion of the pdf is missing some bold text. I have attached documents for a clear understanding.

original.pdf (171.8 KB)
–> This is the original pdf that needs to be converted to docs.

original_using_font_conversion.docx (157.1 KB)

–> Using the below code, i am getting the output where some text is missing.

original_without_using_font_conversion.docx (173.0 KB)
–> Without using the font conversion, some text can be seen as extra bold.

I am expecting the docx fonts to look the same as in the pdf attached.
Is there anything that we are doing wrong in our code here?

    License license = new License();
    license.setLicense("/Users/anil.maharjan/Documents/aspose/license-new/Aspose.Pdf.lic")

    def dataDir = "/Users/anil.maharjan/Documents/full_template/"

    Document doc = new Document(dataDir + "bold.pdf");

    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
    doc.getPages().accept(textFragmentAbsorber);
    Font arial_unicode_ms = FontRepository.findFont("Arial");
    for (TextFragment textFragment : textFragmentAbsorber.getTextFragments())
    {
        TextFragmentState textState = textFragment.getTextState();
        if ("Arial-BoldMT".equals(textState.getFont().getFontName()))
        {
            textState.setFont(arial_unicode_ms);
            textState.setFontStyle(FontStyles.Bold);
        }
    }

    DocSaveOptions saveOption = new DocSaveOptions();
    saveOption.setFormat(DocSaveOptions.DocFormat.DocX);

    doc.save(dataDir + "bold.docx", saveOption);

PS : I am using a mac OS system

Screen Shot 2023-01-27 at 3.41.24 PM.png (18.7 KB)
extra bold.png (28.8 KB)

–>Extra bold can be seen in the docx.

Thanks,

@anilmhjn,

Your Pdf document has embedded fonts. You need those fonts installed in the processing machine that is rendering the document. That is very important. When saving as a DOCX does not do font substitution.

I tried your PDF document in .NET 23.1 and java 22.12. I have the Font installed on my machine.

DotNET_output.docx (150.2 KB)
Java_output.docx (151.3 KB)

I would do the simple exercise of using another font I have installed in the rendering machine and see the result.

Hi,

We were able to get correct rendering in DOCX file by running the above code. However, it’s causing some performance issues when handling large PDF files. We’re getting time to replace fonts up to 1 min for some PDF files. Could you please check if this is an existing issue.

Pharmacy - Age Cohort-01-25-2023-1674610299196.pdf (278.1 KB)

Please check the extra time required to replace fonts when converting to DOCX.

Regards

@srijal
How long does it take you to process this file (“Pharmacy - Age Cohort-01-25-2023-1674610299196.pdf”)?

Hi,

To process this file and replace all the font takes 18.5 s.

index_1-02-06-2023-1675673545712.pdf (350.8 KB)

JVM memory: 8 GB
System: macOS Ventura
Version: 22.9 (Aspose.PDF)

Thanks

@anilmhjn
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-53636

You can obtain Paid Support services if you need support on a priority basis, along with the direct access to our Paid Support management team.

@srijal
In ticket I write
“When checking on my system (Windows 10) 32GB. the time turned out to be 3.5 seconds in total, of which 2.5 for converting pdf to docx.
Please check the performance and the possibility of its improvement.”