I am trying to convert an XPS file to DOCX so that the document can be edited in Microsoft Word.
Currently, converting from XPS to PDF with the Aspose.PDF library is working as expected. It produces a PDF that is searchable and text can be copied out from it.
However, trying similar code for XPS to WORD with the Aspose.PDF library is not working. It produces a DOCX file that contains pictures of the text that cannot be edited. I have also tried converting the XPS to PDF, then PDF to DOCX with the same results. I have also tried with Mode set to Flow and Textbox with the same results.
Can you please help me correct this issue or understand when a DOCX output file can or cannot have editable text.
The code I am testing looks like:
Aspose.Pdf.XpsLoadOptions loadOptions = new Aspose.Pdf.XpsLoadOptions();
using (Aspose.Pdf.Document pdf = new Aspose.Pdf.Document(xpsStream, loadOptions))
{
Aspose.Pdf.DocSaveOptions docSaveOptions = new Aspose.Pdf.DocSaveOptions();
docSaveOptions.Format = Aspose.Pdf.DocSaveOptions.DocFormat.DocX;
docSaveOptions.Mode = Aspose.Pdf.DocSaveOptions.RecognitionMode.Flow;
using (FileStream imageStream = new FileStream(outputPath, FileMode.Create, FileAccess.Write, FileShare.None))
{
pdf.Save(imageStream, docSaveOptions);
}
}
Thanks
samples.zip (9.9 MB)
Attached are the sample input XPS file and the resulting PDF and DOCX files. You can see the PDF has selectable text, while the DOCX renders the page as an image with no selectable text.
@garyhollfelder
Thank you for contacting support.
We have been able to reproduce the issue in our environment. A ticket with ID PDFNET-47176 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.
We are sorry for the inconvenience.
Hello,
Just following up to see if there are any fixes to this issue?
@garyhollfelder
I like to inform that issue is still unresolved and we are working on this issue. We will share news regarding ETA as soon as possible.
@garyhollfelder
XPS file text elements contain OpenType fonts features. Namely, glyph specification by index. For technical reasons, in this case text is converted as vector graphics with invisible text over it. Such behavior applies in XPS-to-PDF conversion. So, when PDF is converted to DOCX, invisible text is ignored.
Better approach is to convert directly from XPS to DOCX (avoiding PDF stage) using static methods Document.Convert():
Aspose.Pdf.XpsLoadOptions loadOptions = new Aspose.Pdf.XpsLoadOptions();
Aspose.Pdf.DocSaveOptions docSaveOptions = new Aspose.Pdf.DocSaveOptions();
docSaveOptions.Format = Aspose.Pdf.DocSaveOptions.DocFormat.DocX;
docSaveOptions.Mode = Aspose.Pdf.DocSaveOptions.RecognitionMode.Textbox;
using (Stream xps = File.OpenRead(testdata + "input_2019-10-23 15_17_17-2.xps"))
using (FileStream doc = new FileStream(testdata + "doc.docx", FileMode.Create, FileAccess.Write, FileShare.None))
Document.Convert(xps, loadOptions, doc, docSaveOptions);
The ticket has been resolved in Aspose.PDF for .NET 20.7.