The bullet style is missing when convert from PDF to Word

Hi guys, I’m in a trouble while trying convert the bullets from PDF to Word file. Here is my code:

using (Stream pdf = //Get PDF file from blob container)
using (var pdfDocument = new Aspose.Pdf.Document(pdf))
using (MemoryStream dstStream = new MemoryStream())
{
    AsposePDF.DocSaveOptions saveOption = new AsposePDF.DocSaveOptions();
    saveOption.Format = AsposePDF.DocSaveOptions.DocFormat.DocX;
    saveOption.Mode = AsposePDF.DocSaveOptions.RecognitionMode.Textbox;
    saveOption.RecognizeBullets = true;

    pdfDocument.Save(dstStream, saveOption);
    fileDoc = new Document(dstStream);
    fileDoc.UpdateListLabels();
}

After using this one, the bullets appeared as expected. However, when I tried to using

DocumentBuilder builder;
builder.MoveToHeaderFooter(HeaderFooterType.HeaderPrimary);

The bullets sometimes couldn’t understand as expected, it’s not always happen but it’s really difficult for us to build a perfect one.

Aspose version:
Word: 22.11.0
PDF: 22.11.0
Img:

@loc.nguyen Could you please attach your input, intermediate and output documents here for testing? Unfortunately, it is no possible to analyze the issue without real documents. We will check the issue and provide you more information.

@alexey.noskov Here is the pdf file and word output file. B/c of some private information, I have to censor the PDF file. Sorry for the inconvenience.

test.pdf (162.2 KB)
COMP-593.docx (1.7 MB)

@loc.nguyen Thank you for additional information. Unfortunately, I cannot reproduce the problem on my side. I used the following code for testing:

// Convert PDF to DOCX using Aspose.PDF
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(@"C:\Temp\in.pdf");
Aspose.Pdf.DocSaveOptions saveOption = new Aspose.Pdf.DocSaveOptions();
saveOption.Format = Aspose.Pdf.DocSaveOptions.DocFormat.DocX;
saveOption.Mode = Aspose.Pdf.DocSaveOptions.RecognitionMode.Textbox;
saveOption.RecognizeBullets = true;
pdfDocument.Save(@"C:\Temp\tmp.docx", saveOption);

// Postprocess DOCX document using Aspose.Words
Aspose.Words.Document doc = new Aspose.Words.Document(@"C:\Temp\tmp.docx");
Aspose.Words.DocumentBuilder builder = new Aspose.Words.DocumentBuilder(doc);
builder.MoveToHeaderFooter(HeaderFooterType.HeaderPrimary);
builder.Write("This is my header");
doc.Save(@"C:\Temp\out.docx");

Is the attached DOCX document produced by Aspose.PDF or is it the document after postprocessing it using Aspose.Words. I do not see the generator name in the attached document.

The docx file is the document after post processing it using Aspose.Words. It’s not always happen as I told you, there are some ratios happening with this issue and I don’t know how to fix it.

@loc.nguyen Unfortunately, I cannot reproduce the problem on my side. Could you please attach the DOCX document produced by Aspose.Pdf on your side, before postprocessing it using Aspose.Words?

Sorry for my bad, I found the place where the issue belongs.
Based on this thread, the font style was read as “UDVTBH+Symbol”. Hence, I converted the style formatting for the Bullets to “Symbol”. However, it couldn’t understand the formatting named “Symbol” so it didn’t appear as expected. When I tried to edit from “Symbol” to “Ymbol”, it worked as expected. I’m not really sure why it causes that problem, could you please help me investigate it?

Aspose ref: Why the font-family is some string+ArialMT when extract PDF to HTML?

Image while setting font for the bullets:

Image result when the font is “Symbol”

Image when I edit the formatting to “ymbol”

The word output file
Bullets Issue.docx (1.7 MB)

@loc.nguyen This looks like a known peculiarity - Windows “Symbol” font is a symbolic font (like “Webdings”, “Wingdings”, etc.) which uses Unicode PUA. MacOS or Linux “Symbol” font on the other hand is a proper Unicode font (for example Greek characters are in the U+0370…U+03FF Greek and Coptic block). So these fonts are incompatible and Mac/Linux “Symbol” font cannot be used instead of Windows “Symbol” without additional actions. In your particular case the bullet is represented as U+2022, but in Windows “Symbol” it is PUA U+F0B7 (or U+00B7 which also can be used in MS Word for symbolic fonts). So you should change U+2022 character to U+00B7:

Document doc = new Document(@"C:\Temp\in.docx");

List<Run> items = doc.GetChildNodes(NodeType.Run, true).Cast<Run>()
    .Where(r => r.Font.Name == "Symbol").ToList();

foreach (Run r in items)
{
    if (r.Text == "\x2022")
        r.Text = "\x00b7";
}

doc.Save(@"C:\Temp\out.docx");

Thank you for your support. It seems my issue has been fixed but I still have one more question: is your code suggested above will work well for both MacOS/Linux and Windows devices?

@loc.nguyen I am afraid, there is no universal solution for MacOS/Linux and Windows devices due to the peculiarities of “Symbol” fonts on different platforms I have mentioned in my previous answer.
To get identical result on different platforms, you can embed fonts into the documents, but this will dramatically increase the documents size.