Conversion issue from pdf to html data

khyatitank · June 24, 2022, 8:07am

Hi Team,

I am facing a conversion issue by using the Micka ortila.pdf (2.0 MB)
attached pdf file

My requirement was to convert it from a pdf file to an HTML string to preview it.

So I convert from pdf to word and then I convert from word to HTML string using aspose.words library.

Issue : Not showing proper content as per attached png file PreviewIssue.png (39.2 KB)

Even it is not converted properly when I convert pdf to word and download it.

I also tried the conversion of a document from the online converter “Convert Files Online - Word, PDF, HTML, JPG And Many More”. The same issue occurred.

We are using Aspose.Words version “13.3.0.0”

Source Code of converting pdf to word file

public static byte[] ConvertPdfToDoc(byte[] input)
{
    SautinSoft.PdfFocus f = new SautinSoft.PdfFocus();
    string rtf = "";
    f.Serial = System.Configuration.ConfigurationManager.AppSettings["PDFFocus.License"];

    f.OpenPdf(input);
    rtf = f.ToWord();
    byte[] wordBytes = System.Text.Encoding.UTF8.GetBytes(rtf);
    MemoryStream docXStream = new MemoryStream(wordBytes);
    Document docX = new Document(docXStream);
    MemoryStream docXOutStream = new MemoryStream();
    OoxmlSaveOptions docSaveOpt = new OoxmlSaveOptions(SaveFormat.Docx);

    docSaveOpt.PrettyFormat = true;
    docX.JoinRunsWithSameFormatting();
    Document nDocX = ProcessLetterSpacing(docX);

    if (nDocX == null)
    {
        docX.Save(docXOutStream, docSaveOpt);
    }
    else
    {
        nDocX.Save(docXOutStream, docSaveOpt);
    }
    byte[] b = docXOutStream.GetBuffer();
    byte[] b2 = new byte[docXOutStream.Length];
    Buffer.BlockCopy(b, 0, b2, 0, (int)docXOutStream.Length);
    return b2;
}

Source Code of converting word bytes to html string

public string ConvertToHtml()
{
    string html = string.Empty;

    MemoryStream output = new MemoryStream();

    HtmlSaveOptions saveOpt = new HtmlSaveOptions(SaveFormat.Html);
    saveOpt.PrettyFormat = false;
    saveOpt.ImageSavingCallback = new HandleImageSaving(this);
    saveOpt.ExportImagesAsBase64 = true;

    this.ValidateHeaders();
    byte[] b;
    byte[] b2;
    try
    {
        this._document.Save(output, saveOpt);
        b = output.GetBuffer();

        b2 = new byte[output.Length];
        Buffer.BlockCopy(b, 0, b2, 0, (int)output.Length);
        html = Encoding.UTF8.GetString(b2);
    }
    catch (System.ArithmeticException exArithmetic)
    {
        this._document.Save(output, SaveFormat.Text);
        b = output.GetBuffer();

        b2 = new byte[output.Length];
        Buffer.BlockCopy(b, 0, b2, 0, (int)output.Length);

        html = Encoding.UTF8.GetString(b2);
        html.Replace("\r\n", "<br/>").Replace("\t", "").Replace("\"", "'");
    }

    return html;
}

Let me know what I should do to resolve this issue?

alexey.noskov · June 24, 2022, 8:50am

@khyatitank As I can see in your code you are using some third-party tool for conversion from PDF to RTF (SautinSoft.PdfFocus). Then You use Aspose.Words to convert RTF to DOCX and then DOCX to HTML. So I suspect the problem occurs on the first stage - conversion form PDF to RTF.
With the latest version of Aspose.Words you can directly convert from PDF to HTML using code like the following:

Document doc = new Document(@"C:\Temp\Micka ortila.pdf");
doc.Save(@"C:\Temp\out.html");

NOTE: The feature is available only in .NET 4.6.1, .NET Standard 2.0 and .NET 6.0 versions of Aspose.Words.

However, such conversion also does not give an accurate result. I have logged the problem as WORDSNET-24023. We will keep you updated and let you now once it is resolved or we have more information for you.

khyatitank · July 11, 2022, 5:44am

Hello ,

Is it feasible to convert pdf byte array to html string by using Aspose.Words version “13.3.0.0”.
If yes then could you share me the code for the same ?

As i tried with this version but i am not able to convert it properly.

alexey.noskov · July 11, 2022, 6:02am

@khyatitank The feature is available starting from 20.2 version of Aspose.Words. Unfortunately in earlier versions of Aspose.Words it is not possible to directly load PDF documents in Aspose.Words.Document object.

khyatitank · July 11, 2022, 6:15am

Hi,

Can you share with me the code of the 20.2 version ?

Conversion from pdf byte array to html string. File to file conversion is not required.
I required from pdf byte array to html string

So that i can create one POC and test it

alexey.noskov · July 11, 2022, 6:26am

@khyatitank You can use code like the following to convert PDF bytes to HTML string:

private string PdfToHtmlString(byte[] pdfBytes)
{
    using (MemoryStream pdfStream = new MemoryStream(pdfBytes))
    {
        Document doc = new Document(pdfStream);

        // Embed images as base64 into the HTML string.
        HtmlSaveOptions options = new HtmlSaveOptions();
        options.ExportImagesAsBase64 = true;

        return doc.ToString(options);
    }
}

khyatitank · July 11, 2022, 10:31am

20.2 version is free or paid ?

alexey.noskov · July 11, 2022, 2:29pm

@khyatitank Aspose.Words is commercial product. To use any version you need to have a valid license. Please see the Licensing section in our documentation for more information.

khyatitank · September 21, 2022, 11:52am

Hello

You suggested me to use latest version of aspose.words with below-mentioned code

Document doc = new Document(@"C:\Temp\Micka ortila.pdf");
doc.Save(@"C:\Temp\out.html");

I am facing the same issue with the latest version also.
image.png (229.9 KB)

Could you please help me on this ?

alexey.noskov · September 21, 2022, 1:30pm

@khyatitank The issue you have reported is not resolved yet. With the latest version of Aspose.Words you can convert PDF to Fixed HTML without loading PDF into Aspose.Words flow document model. You can achieve this using code like this:

Aspose.Words.Pdf2Word.FixedFormats.PdfFixedRenderer pdfRenderer = new Aspose.Words.Pdf2Word.FixedFormats.PdfFixedRenderer();
using (FileStream pdfStream = File.OpenRead(@"C:\Temp\in.pdf"))
{
    using (FileStream htmlStream = File.Create(@"C:\Temp\out.html"))
    {
        using (Stream outStream = pdfRenderer.SavePdfAsHtml(pdfStream))
        {
            outStream.CopyTo(htmlStream);
        }
    }
}

We plan to wrap this code into more convenient API in one of future versions.

khyatitank · September 21, 2022, 1:44pm

May i know tentative timeline when it will be fixed ?

alexey.noskov · September 21, 2022, 1:47pm

@khyatitank Unfortunately, the issue is currently postponed and is not yet scheduled for development. So at the moment we cannot provide you any estimates.

khyatitank · September 21, 2022, 2:04pm

Hello

We are using aspose 13.3.0.0 version

Aspose.Words.Pdf2Word.FixedFormats.PdfFixedRenderer pdfRenderer = new Aspose.Words.Pdf2Word.FixedFormats.PdfFixedRenderer();
using (FileStream pdfStream = File.OpenRead(@"C:\Temp\in.pdf"))
{
    using (FileStream htmlStream = File.Create(@"C:\Temp\out.html"))
    {
        using (Stream outStream = pdfRenderer.SavePdfAsHtml(pdfStream))
        {
            outStream.CopyTo(htmlStream);
        }
    }
}

And this code will not work in the 13.3.0.0 version.

Do you have any other code which works for 13.3.0.0?

Because to upgrade the version from 13.3.0.0 to the latest version we do not have any approval right now.

alexey.noskov · September 21, 2022, 2:06pm

@khyatitank No, unfortunately 13.3.0 version of Aspose.Words does not support this feature.

khyatitank · November 21, 2022, 10:54am

alexey.noskov

Sep 21

@khyatitank Unfortunately, the issue is currently postponed and is not yet scheduled for development. So at the moment, we cannot provide you with any estimates.

Is this issue resolved in the latest version?

alexey.noskov · November 21, 2022, 11:47am

@khyatitank Unfortunately, there are no news regarding the issue. The issue is still postponed.