Wrong content after conversion to docx

softboy · June 4, 2024, 5:09am

version 23.8:
issue 1: characters were duplicated:
image.png (20.2 KB)
issue2: cell data were merged together to one
image.png (17.7 KB)

issue 3: 2 cells were merged to 1
image.png (20.2 KB)

pdf file:
#5993504（QWP002-PCO-B1）Vendor Gateway 2024.5.14（GZHL2404016509LW）-TRF.pdf (76.6 KB)

sergei.shibanov · June 4, 2024, 3:28pm

@softboy
I used this code and got the output docx document and compared it with the original pdf.

var doc = new Document(dataDir + "5993504.pdf");
doc.Save(dataDir + "5993504-out.docx");

I don’t have those discrepancies (see screenshot) that are indicated here.

What version of the library are you using?

softboy · June 5, 2024, 1:56am

we use 23.8 version,

the following is the one which i tried with you latest 24.5.1 version:

DocSaveOptions saveOptions = new DocSaveOptions
{
    // Specify the output format as DOCX
    Format = DocSaveOptions.DocFormat.DocX,
    // Set other DocSaveOptions params
    Mode = DocSaveOptions.RecognitionMode.EnhancedFlow
};

image.png (28.0 KB)

sergei.shibanov · June 5, 2024, 12:28pm

@softboy
Thanks for the explanation - this version reproduces the problem.

sergei.shibanov · June 5, 2024, 3:49pm

In this case, I believe the transformation was correct. If, say, the words were separated by spaces (or at least one space), one would expect a division into two cells. But in the given case, one cell per word that takes up the entire width is a completely expected solution and I do not consider this wrong.

sergei.shibanov · June 5, 2024, 5:03pm

@softboy
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-57355

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.