While I was trying to use the toolkit for converting PDF file to DOCX file, I found that the result looked good in DOCX itself but there was actually an issue with the generated DOCX file when I filtered it with Okapi openxml filter. In the generated .xlf file, there are a lot of tags which prevents from getting complete sentences, which normally shouldn’t be the case for a regular DOCX file.
For example, a sentence in the generated DOCX like “Hello, how are you doing today” will become something like
<source xml:lang="enus"><g id="1">Hello,</g><g id="2">how</g><g id="3">are</g><g id="4">you</g><g id="5">today</g></source>
in the filtered file instead of
<source xml:lang="enus">Hello, how are you doing today</source>
I think during the conversion from PDF to DOCX, these tags are somehow added to the file. While they do not show in the DOCX file directly, they would show in the filtered file by Okapi.
I am wondering whether you are able to help with this situation.