Pdf to html for arabic documents using java library

Currently we are using Aspose.PDF to convert arabic pdf documents to html format but we’re not getting the same results as your cloud version (the result is not well represented and we’re getting better results when using word as a pivot format with EnhancedFlow), is there any specific html options that we need to use when converting pdf to html so we could have a good results same as your cloud version ?

for your information, we’re using the latest version which is 23.3 and a valid license.

@aelghaoui,

Do you mind sharing the document? Alsoe the code snippet please.

Do you have the fonts installed in the machine running the code?

@carlos.molina yes sure,
simple arabic file.pdf (81.9 KB)
this is the pdf example that we’re using and I’m using the following code:

    Document pdfDocument = new Document("simple arabic file.pdf");
    pdfDocument.save("output_out.html", SaveFormat.Html);

you can find the results here simple arabic file.zip (138.3 KB)

and it obvious that when using java code we’re not keeping words consistency (we can not select a single word from right to left) which is not the case with the cloud version
for the fonts question, yes I have all the font installed.

@aelghaoui,

This is my code:

private void Logic()
{
    Document doc = new Document($"{PartialPath}_input.pdf");

    HtmlSaveOptions saveOptions = new HtmlSaveOptions();

    saveOptions.PartsEmbeddingMode = PartsEmbeddingModes.EmbedAllIntoHtml;
    saveOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
    saveOptions.RasterImagesSavingMode = RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;

    

    // Save output PDF document
    doc.Save($"{PartialPath}_output.html", saveOptions);

}

And this is the output, which is text and not an image, so you can select it.
ConvertPdfToHtml_2_output.zip (44.9 KB)

Thanks @carlos.molina but you’re getting the same results as mine, if you inspect your html output you will be able to see that the span tags encounter randomly the characters not words or phrases which makes impossible to select a word from right to left, you can compare with the one from the cloud Convertir PDF En HTML En Ligne

@aelghaoui,

Sorry, in my ignorance of the language, I did not understand that each character was a word, not a letter. You are right. Sometimes when going right to the left will highlight several characters.

I do not think it is a problem with the options but more than an issue when generating the conversion. I will be generating a ticket for the dev team.

1 Like

@aelghaoui
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-54479

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

2 Likes