Unicode representation changes while doing a docx to pdf

Rajesh_Mohanty · November 3, 2015, 8:35pm

Hi ,

I am using Aspose 15.09 .I am using aspose api to covert a arabic document to pdf using following code -

LoadOptions lk= new LoadOptions();

lk.setWarningCallback(new HandleDocumentWarnings());

lk.setEncoding(“UTF-8”);

Document doc=new Document(“Hello.docx”,lk);

PdfSaveOptions options=new PdfSaveOptions();

doc.save(“asposePdf.pdf”,options);

What i have observed that unicode representation of arabic character changes form what is present in base document .For example certain character unicode I get in pdf

ﻮ - Character unicode :65262

ﻫ - Character unicode :65259

Same character unicode in base document (i.e docx file )-

و - Character unicode :1608

ه - Character unicode :1607

One thing to note that is when I convert docx to pdf using microsoft’s save as i get right same unicode value as docx .

I expect the pdf generated by aspose api to behave similarly to what Microsoft provides .

tahir.manzoor · November 4, 2015, 8:33am

Hi Rajesh,

Thanks for your inquiry. We have tested the scenario and have not found any issue with output Pdf. Please check the attached output Pdf.

Could you please share here the steps which you are using to get the unicode characters? We will investigate the issue on our side and provide you more information.

Rajesh_Mohanty · November 4, 2015, 9:13am

I tested with your attached PDF and I can still see the issue .Unicode representation for the character is different than the docx file .

ﻮ - Character unicode :65262

ﻫ - Character unicode :65259

I am using Apache PDFBox api to extract text out of pdf .For debugging purpose i am printinge the unicode also .Here is the code I used in PDFBox

protected void processTextPosition(TextPosition text) {

try {

out1 = new PrintStream(System.out,true,“UTF-8”);

} catch (UnsupportedEncodingException e) {

}

float x = text.getX();

float y = text.getY();

out1.print(text.getCharacter());

System.out.println(" - Character unicode :"+text.getCharacter().codePointAt(0));

super.processTextPosition(text);

}

This is a callback method which gets called whenver PDFBox see a text content in the document .You can see that I ma trying to print the character and and its unicode .

tahir.manzoor · November 5, 2015, 7:58am

Hi Rajesh,

Thanks for sharing the detail. We have tested the scenario and have managed to reproduce the same issue at our side. For the sake of correction, we have logged this problem in our issue tracking system as WORDSNET-12609. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

codewarior · November 5, 2015, 7:58am

Hi Rajesh,

Thanks for using our API’s.

We have an API named Aspose.Pdf for Java which provides the feature to create as well as manipulate existing PDF files. It also offers the feature to extract contents from PDF file. When using code snippet specified over Extract Text from PDF using Text Device, the contents are properly being extracted. For your reference, I have also attached the output generated over my end.

Can you please try using our API and see if it fulfills your requirement. Should you have any further query, please feel free to contact.

aspose.notifier · May 8, 2021, 3:28pm

The issues you have found earlier (filed as WORDSNET-12609) have been fixed in this Aspose.Words for .NET 21.5 update and this Aspose.Words for Java 21.5 update.