How to removal textbox and keep text when processing pdf to doc?

ducaisoft · August 8, 2020, 9:43am

Hi, Support:

How to set the pdftodoc conversion so that it can remove textbox and keep text when processing pdf converting to doc?
other queries are:

if there are two pics overlap in pdf, after converting to doc, the overlapping pics are been handled as one pic? why?
why some table was handled as pic after conversion?
rotated text in pdf was converted as pic in doc? why?
the bottom header line in pdf (generated by MS Office Word) was converted as pic, not header bottom line, why?
each paragraph in pdf was converted as a textbox? why? could the dll convert them as true text paragraph without text included textbox?
can the dll convert the inline image in the pdf as inlineshape image in word? now the converter convert all the mages as floating shapes?

how to solve the queries , or how to fix them?

Thanks for your help!

asad.ali · August 10, 2020, 5:19pm

@ducaisoft

In order to observe all these issues and address them, we need a sample PDF document from your side. Would you kindly share it with us so that we can test the scenario in our environment and address it accordingly.

ducaisoft · August 10, 2020, 6:21pm

Test.pdf (397.3 KB)

asad.ali · August 11, 2020, 6:07pm

@ducaisoft

We were able to observe five of the mentioned issues in our environment and logged them under the ticket ID PDFNET-48635 in our issue tracking system. We are currently working over implementing a new PDF to Word converting engine and will surely pay attention to the issues you mentioned. Please give us some time.

In order to prevent above issue, you can use RecognitionMode.Flow instead of TextBox as following:

using (Document pdfFile = new Document(dataDir + "Test.pdf"))
{
 pdfFile.Save(dataDir + "output.docx", new DocSaveOptions() 
 {
  Format = DocSaveOptions.DocFormat.DocX,
  Mode = DocSaveOptions.RecognitionMode.Flow,
  RecognizeBullets = true
 });
}

We are sorry for the inconvenience.

ducaisoft · August 11, 2020, 9:48pm

Thanks for your reply.
By using
Mode = DocSaveOptions.RecognitionMode.Flow,
RecognizeBullets = true

There is no textbox in the converted doc, however, a return has been added at the end of each line in a whole paragraph. Could the feature of added a return at the end of each line be cancelled?

And I has an idea when converting pdf to doc, that is : extract each element in each pdf page, such as header/footer text/text color/text font/text fontsize… as well each each tables (its row count and column count, content of each cell),each paragraph content and font size, as well as the parameters of each image(inline or floating , behind or front text, watermark , stamp, position, size, )， as well as other object and attathments, all are extracted as new stuff, and then using the extracted stuff to generate a new doc, for this idea, the redundant return at the end of each line in one paragraph will be removed, the overlapped images could be separated, the tables could be handled as image…

Here, if setting DocSaveOp.AddReturnToLineEnd = False, there are still too many returns added at the end of each line, which is a bug?

asad.ali · August 12, 2020, 6:47pm

@ducaisoft

We have updated the issue information accordingly and have noted your suggestions as well. We will inform you as soon as we have some updates regarding ticket resolution. Please give us some time.

We are sorry for the inconvenience.