Automatic hyphenation in rendered PDF and PNG not accurate for certain languages

Dear Aspose Team,

after testing the provided dictionaries from the following link, we found, that for some languages, especially german, some words are hyphenated quite inaccurate.

E.g. the Word “Altersrente” is hyphenated into “Alter-srente”, which is invalid.

There are several other examples.

In general: Is there any way, to improve the quality of the hyphenation?

The structure of the dictionaries looks also quite non-intuitive, so any help from your side is much appreciated, so that we can improve the quality there.

Please find attached the source document as well as the pdf conversion.

Thanks and kind regards

Wolfgang

Hi Wolfgang,

Thanks for your inquiry. Unfortunately, it is not possible for us to include or certify any Hyphenation Dictionaries to use with Aspose.Words. This is because creating dictionaries is a specialized area. Dictionaries are created by some linguistic experts, not by software developers. Therefore, you should find and use good open source Hyphenation Dictionaries to satisfy your requirements. However, how did you produce PDF (hyphenation.pdf) that is attached in your post? Did you use “hyph_de_CH.zip” or " hyph_de_DE.zip"? It would be great if you please attach “source code” and “related dictionary” here for further testing.

Best regards,

Hi Awais,

sorry for the lack of information.

Please find attached the used Dictionary.

The distilled coding for setting the dictionary and converting the document is:

Hyphenation.RegisterDictionary(“de-DE”, @“C:\dict\hyph_de_DE.dic”);
Document doc = new Document(sourceDocPath);
doc.Save(convertedDocPath, SaveFormat.Pdf);

Hi Wolfgang,

Thanks for the additional information. Strangely, when using your code and resources on my end, Aspose.Words 15.7.0 does not hyphenate the word “Altersrente” at all. Please see attached PDF file. Most likely it is an issue with dictionary itself. Please let me know if I can be of any further assistance.

Best regards,

Hi Awais,

sorry for my delayed response.

You are right, with the provided code and resources, the issue is not appearing.

I was now able to get back to the issue and do some more testings.

It seems, that the problem only occurs, if the german and english dictionaries are set together.

Please see the following coding which we use for reproducing the issue.

I have also attached the two used dictionaries as well as the source document.

We are still talking about the wrongly split word “Alter-srente” around line 14-15 in the resulting PDF document.

public void TestHyphenationAltersrente() throws Exception
{
    String sourceDocPath = "C:\hyphenation\autoHyphen.docx";
    String convertedDocPath = "C:\hyphenation\autoHyphen_java.pdf";
    AsposeLicenseInitializer.getInstance().initializeLicensesIfMust();
    Hyphenation.registerDictionary("de-DE", "C:\dict\hyph_de_DE.dic");
    Hyphenation.registerDictionary("en-US", "C:\dict\hyph_en_US.dic");
    byte[]
    wordData = FileUtils.readFileToByteArray( Paths.get("C:\hyphenation\autoHyphen.docx").toFile());
    SaveOptions saveOptions = SaveOptions.createSaveOptions(SaveFormat.PDF);
    byte[] result = null;
    saveOptions.setDmlRenderingMode(DmlRenderingMode.DRAWING_ML);
    ByteArrayInputStream docBinStream = new ByteArrayInputStream(wordData);
    ByteArrayOutputStream outStream = new ByteArrayOutputStream();
    Document document = new Document(docBinStream);
    document.save(outStream, saveOptions);
    result = outStream.toByteArray();
    FileUtils.writeByteArrayToFile(Paths.get("C:\hyphenation\autoHyphen_java.pdf").toFile(), result);
}

Hi Wolfgang,

Thanks for the additional information. We tested the scenario and have managed to reproduce the same problem on our end. For the sake of any correction, we have logged this problem in our issue tracking system as WORDSNET-12317. Our product team will further look into the details of this problem and we will keep you updated on the status of correction. We apologize for your inconvenience.

Best regards,

Hi Wolfgang,

Regarding WORDSNET-12317, our product team has completed the analysis of this issue and has come to a conclusion that this issue and the undesired behaviour you’re observing is actually not a bug in Aspose.Words. So, we will most likely close this issue as 'Not a Bug. The hyphenation quality is mostly related to dictionaries and we have no control of these. If OpenOffice dictionary for German is not good, we cannot do much about it. Please search for a better hyphenation dictionary format or may be build your own. Please also check the following article:
https://wiki.openoffice.org/wiki/Documentation/SL/Using_TeX_hyphenation_patterns_in_OpenOffice.org

Best regards,

Hi Awais,

thank you for the feedback.

We are not sure, that it is related to the quality of one specific dictionary.

As mentioned, the bug appears not from one specific dictionary (at first we accidently were suspecting the used german dictionary), but from the combination of multiple dictionaries.

So probably it is still something inside of Aspose, how the application of the set dictionaries is handled.

We also tested with Open Office, using the same dictionary for german language and there the issue is not appearing.

I kindly ask the development team to have another look.

Or at least, is there a possibility to provide some insight, how Aspose deals with the different hyphenation dictionaries for different languages and on what specifics the related language for a specific document is selected?

Thanks and kind regards
Wolfgang

Hi Wolfgang,

Thanks for the details. I have passed your concern to our product team. They’ll investigate it further and we’ll keep you informed of any further developments.

Best regards,

Hi Wolfgang,

Please check below detailed analysis supplied by our product team:

EffectiveLocaleId for spans in 4th paragraph (Das Arbeitsverhältnis…) is detected as EnglishUS.

In reveal formatting I see German language.

It is the reason why “the problem only occurs, if the german and english dictionaries are set together.”.

Aspose.Words tries to apply English hyphenation dictionary to words in 4th paragraph and it gives wrong hyphenation of the word “gesetzliche”.

MS Word shows this paragraph as US-ENGLISH. This is because it does not have w:lang specified but has w:rStyle “DefaultParagraphFont” which is most likely English. So I don’t think Aspose.Words’ model is at fault here. MS Word also uses red wavy underline for the paragraph to indicate that spell check failed.

Following code allows to make Aspose.Words to use German dictionary for 4th paragraph.

Document doc = new Document(MyDir + @“hyphenation.docx”);

// It seems, that the problem only occurs, if the german and english dictionaries are set together.
Hyphenation.RegisterDictionary(“de-DE”, MyDir + “hyph_de_DE.dic”);
Hyphenation.RegisterDictionary(“en-US”, MyDir + “hyph_de_DE.dic”);

doc.Save(MyDir + @“15.7.0.pdf”);

4th paragraph looks OK now, but the word “Altersrente” still hyphenated wrong in the last paragraph. It is quality problem of German hyphenation dictionary. We can resolve this problem by adding follow pattern into German dictionary

.al9ters9ren8te.

So, it might be that Open Office still recognizes this sentence as German and applies correct dictionary, or it might use different dictionary for English. The workaround is to apply German language to the paragraph in Word. It might also help to add patters presented above into the German dictionary.

Best regards,

Hi Awais,

thank you so much for the detailed feedback.
We will continue find a workaround on our side.
And I think, the provided insight on the functionality and circumstances help us big time there.

Thanks and kind regards
Wolfgang

Hi Wolfgang,

Thanks for your understanding. In case you have further inquires related to Aspose.Words, please let us know.

Best regards,