We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

PDF to DOCX Excessive Tags Generated

Hello,

While I was trying to use the toolkit for converting PDF file to DOCX file, I found that the result looked good in DOCX itself but there was actually an issue with the generated DOCX file when I filtered it with Okapi openxml filter. In the generated .xlf file, there are a lot of tags which prevents from getting complete sentences, which normally shouldn’t be the case for a regular DOCX file.
For example, a sentence in the generated DOCX like “Hello, how are you doing today” will become something like
<source xml:lang="enus"><g id="1">Hello,</g><g id="2">how</g><g id="3">are</g><g id="4">you</g><g id="5">today</g></source>
in the filtered file instead of
<source xml:lang="enus">Hello, how are you doing today</source>

I think during the conversion from PDF to DOCX, these tags are somehow added to the file. While they do not show in the DOCX file directly, they would show in the filtered file by Okapi.
I am wondering whether you are able to help with this situation.

Thank you!

@qs1,

Thanks for contacting support.

Can you please share source files along with sample code so that we may further investigate to help you out.

test.pdf (2.1 MB)
I used the above file as the source PDF file and tried to convert it to DOCX using the following code

final ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
final com.aspose.pdf.Document asposeDocument = 
    new com.aspose.pdf.Document(new ByteArrayInputStream(sourceDocumentBytes));
final DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
asposeDocument.save(outputStream);
sourceDocument.setSourceDocument(new ByteArrayInputStream(outputStream.toByteArray()));

The conversion looked pretty good in DOCX format but with the issue described above.

It looks like I am not able to upload the generated DOCX file due to the format not supported in this forum.

@qs1,

Thank you for sharing source code with us.

I have worked with source code and sample file shared by you and unable to observe the issue. Can you please share comparison screenshot with us to further investigate this issue. Also I have shared my generated result with you for your kind reference.t11est_13.zip (435.5 KB)

Hi @Adnan.Ahmad,

Thank you for your response! The code actually produces the same result as yours on my side. There is no big issue when I view it as a DOCX file. However, when I further filtered it using Okapi document filter https://okapiframework.org/wiki/index.php/Filters, I am expecting something like

Screen Shot 2020-06-01 at 2.56.06 PM.png (226.7 KB)

but I am getting something like

Screen Shot 2020-06-01 at 2.57.34 PM.png (248.0 KB)

You can observe there are actually a lot of <g > tags breaking the sentences into parts, which is not desirable for us. I know that this is not a issue when viewing just in DOCX format. But it would be great if you could help with this situation.

By the way, I am pretty sure that these tags originated from Aspose PDF because I converted the generated DOCX file into a zip file and uncompressed it to see the .xml source of it. In the .xml source, there are tags breaking sentences into small parts already, matching the <g> tags pattern in the filtered file, which explains why the filtered file from Okapi has the tags.

Thanks!

@qs1,

I have observed your issue and like to inform that I have created investigation ticket with ID PDFNET-48342 in our issue tracking system to investigate and resolve this issue as soon possible.

Hi @Adnan.Ahmad,

Thank you!

@qs1

You are welcome.