Conversion issue from pdf to html data

Hi Team,

I am facing a conversion issue by using the Micka ortila.pdf (2.0 MB)
attached pdf file

My requirement was to convert it from a pdf file to an HTML string to preview it.

So I convert from pdf to word and then I convert from word to HTML string using aspose.words library.

Issue : Not showing proper content as per attached png file PreviewIssue.png (39.2 KB)

Even it is not converted properly when I convert pdf to word and download it.

I also tried the conversion of a document from the online converter “Convert Files Online - Word, PDF, HTML, JPG And Many More”. The same issue occurred.

We are using Aspose.Words version “13.3.0.0

Source Code of converting pdf to word file

public static byte[] ConvertPdfToDoc(byte[] input)
{
    SautinSoft.PdfFocus f = new SautinSoft.PdfFocus();
    string rtf = "";
    f.Serial = System.Configuration.ConfigurationManager.AppSettings["PDFFocus.License"];

    f.OpenPdf(input);
    rtf = f.ToWord();
    byte[] wordBytes = System.Text.Encoding.UTF8.GetBytes(rtf);
    MemoryStream docXStream = new MemoryStream(wordBytes);
    Document docX = new Document(docXStream);
    MemoryStream docXOutStream = new MemoryStream();
    OoxmlSaveOptions docSaveOpt = new OoxmlSaveOptions(SaveFormat.Docx);

    docSaveOpt.PrettyFormat = true;
    docX.JoinRunsWithSameFormatting();
    Document nDocX = ProcessLetterSpacing(docX);

    if (nDocX == null)
    {
        docX.Save(docXOutStream, docSaveOpt);
    }
    else
    {
        nDocX.Save(docXOutStream, docSaveOpt);
    }
    byte[] b = docXOutStream.GetBuffer();
    byte[] b2 = new byte[docXOutStream.Length];
    Buffer.BlockCopy(b, 0, b2, 0, (int)docXOutStream.Length);
    return b2;
}

Source Code of converting word bytes to html string

public string ConvertToHtml()
{
    string html = string.Empty;

    MemoryStream output = new MemoryStream();

    HtmlSaveOptions saveOpt = new HtmlSaveOptions(SaveFormat.Html);
    saveOpt.PrettyFormat = false;
    saveOpt.ImageSavingCallback = new HandleImageSaving(this);
    saveOpt.ExportImagesAsBase64 = true;

    this.ValidateHeaders();
    byte[] b;
    byte[] b2;
    try
    {
        this._document.Save(output, saveOpt);
        b = output.GetBuffer();

        b2 = new byte[output.Length];
        Buffer.BlockCopy(b, 0, b2, 0, (int)output.Length);
        html = Encoding.UTF8.GetString(b2);
    }
    catch (System.ArithmeticException exArithmetic)
    {
        this._document.Save(output, SaveFormat.Text);
        b = output.GetBuffer();

        b2 = new byte[output.Length];
        Buffer.BlockCopy(b, 0, b2, 0, (int)output.Length);

        html = Encoding.UTF8.GetString(b2);
        html.Replace("\r\n", "<br/>").Replace("\t", "").Replace("\"", "'");
    }

    return html;
}

Let me know what I should do to resolve this issue?

@khyatitank As I can see in your code you are using some third-party tool for conversion from PDF to RTF (SautinSoft.PdfFocus). Then You use Aspose.Words to convert RTF to DOCX and then DOCX to HTML. So I suspect the problem occurs on the first stage - conversion form PDF to RTF.
With the latest version of Aspose.Words you can directly convert from PDF to HTML using code like the following:

Document doc = new Document(@"C:\Temp\Micka ortila.pdf");
doc.Save(@"C:\Temp\out.html");

NOTE: The feature is available only in .NET 4.6.1, .NET Standard 2.0 and .NET 6.0 versions of Aspose.Words.

However, such conversion also does not give an accurate result. I have logged the problem as WORDSNET-24023. We will keep you updated and let you now once it is resolved or we have more information for you.

Hello ,

Is it feasible to convert pdf byte array to html string by using Aspose.Words version “13.3.0.0”.
If yes then could you share me the code for the same ?

As i tried with this version but i am not able to convert it properly.

@khyatitank The feature is available starting from 20.2 version of Aspose.Words. Unfortunately in earlier versions of Aspose.Words it is not possible to directly load PDF documents in Aspose.Words.Document object.

Hi,

Can you share with me the code of the 20.2 version ?

Conversion from pdf byte array to html string. File to file conversion is not required.
I required from pdf byte array to html string

So that i can create one POC and test it

@khyatitank You can use code like the following to convert PDF bytes to HTML string:

private string PdfToHtmlString(byte[] pdfBytes)
{
    using (MemoryStream pdfStream = new MemoryStream(pdfBytes))
    {
        Document doc = new Document(pdfStream);

        // Embed images as base64 into the HTML string.
        HtmlSaveOptions options = new HtmlSaveOptions();
        options.ExportImagesAsBase64 = true;

        return doc.ToString(options);
    }
}

20.2 version is free or paid ?

@khyatitank Aspose.Words is commercial product. To use any version you need to have a valid license. Please see the Licensing section in our documentation for more information.

Hello

You suggested me to use latest version of aspose.words with below-mentioned code

Document doc = new Document(@"C:\Temp\Micka ortila.pdf");
doc.Save(@"C:\Temp\out.html");

I am facing the same issue with the latest version also.
image.png (229.9 KB)

Could you please help me on this ?

@khyatitank The issue you have reported is not resolved yet. With the latest version of Aspose.Words you can convert PDF to Fixed HTML without loading PDF into Aspose.Words flow document model. You can achieve this using code like this:

Aspose.Words.Pdf2Word.FixedFormats.PdfFixedRenderer pdfRenderer = new Aspose.Words.Pdf2Word.FixedFormats.PdfFixedRenderer();
using (FileStream pdfStream = File.OpenRead(@"C:\Temp\in.pdf"))
{
    using (FileStream htmlStream = File.Create(@"C:\Temp\out.html"))
    {
        using (Stream outStream = pdfRenderer.SavePdfAsHtml(pdfStream))
        {
            outStream.CopyTo(htmlStream);
        }
    }
}

We plan to wrap this code into more convenient API in one of future versions.

May i know tentative timeline when it will be fixed ?

@khyatitank Unfortunately, the issue is currently postponed and is not yet scheduled for development. So at the moment we cannot provide you any estimates.

Hello

We are using aspose 13.3.0.0 version

Aspose.Words.Pdf2Word.FixedFormats.PdfFixedRenderer pdfRenderer = new Aspose.Words.Pdf2Word.FixedFormats.PdfFixedRenderer();
using (FileStream pdfStream = File.OpenRead(@"C:\Temp\in.pdf"))
{
    using (FileStream htmlStream = File.Create(@"C:\Temp\out.html"))
    {
        using (Stream outStream = pdfRenderer.SavePdfAsHtml(pdfStream))
        {
            outStream.CopyTo(htmlStream);
        }
    }
}

And this code will not work in the 13.3.0.0 version.

Do you have any other code which works for 13.3.0.0?

Because to upgrade the version from 13.3.0.0 to the latest version we do not have any approval right now.

@khyatitank No, unfortunately 13.3.0 version of Aspose.Words does not support this feature.

alexey.noskov

Sep 21

@khyatitank Unfortunately, the issue is currently postponed and is not yet scheduled for development. So at the moment, we cannot provide you with any estimates.

Is this issue resolved in the latest version?

@khyatitank Unfortunately, there are no news regarding the issue. The issue is still postponed.