How to convert from pdf to html data with font styles

SouvikMoun · May 19, 2023, 4:30pm

Hi,
I want to convert pdf file (which located in SharePoint) to html data with font styles and images using aspose.word and C#. Please help on this topic.

Regards,
Souvik

eduardo.canal · May 19, 2023, 5:23pm

@SouvikMoun
To read a file in SharePoint using C#, you can utilize the SharePoint Client Object Model (CSOM) or SharePoint REST API. Here’s an example of how to read a file using CSOM:
First, you’ll need to add references to the required SharePoint CSOM assemblies in your project. You can find these assemblies in the SharePoint Client Components SDK.
After that you can load your file as byte array and use the Aspose.Words Document constructor that read the file from a Stream.
See the following example:

using System;
using Microsoft.SharePoint.Client;
using Aspose.Words;
using Aspose.Words.Loading;

class Program
{
    static void Main(string[] args)
    {
        string siteUrl = "https://your-sharepoint-site-url";
        string libraryName = "Documents";
        string fileName = "example.pdf";

        using (ClientContext context = new ClientContext(siteUrl))
        {
            // Provide SharePoint credentials
            context.Credentials = new SharePointOnlineCredentials("username", "password");

            // Retrieve the file
            List sharedDocumentsList = context.Web.Lists.GetByTitle(libraryName);
            context.Load(sharedDocumentsList);
            context.ExecuteQuery();

            CamlQuery query = CamlQuery.CreateAllItemsQuery();
            ListItemCollection items = sharedDocumentsList.GetItems(query);
            context.Load(items);
            context.ExecuteQuery();

            // Find the file by its name
            ListItem fileItem = items.Cast<ListItem>()
                .FirstOrDefault(item => item.FileSystemObjectType == FileSystemObjectType.File
                && item["FileLeafRef"].ToString() == fileName);

            if (fileItem != null)
            {
                // Load the file data
                context.Load(fileItem.File);
                context.ExecuteQuery();

                // Read the file data
                FileInformation fileInfo = File.OpenBinaryDirect(context, fileItem.File.ServerRelativeUrl);
                byte[] fileContent = new byte[fileInfo.Stream.Length];
                fileInfo.Stream.Read(fileContent, 0, fileContent.Length);

                // Create a MemoryStream from the byte array
                MemoryStream memoryStream = new MemoryStream(byteArray);
                Document doc = new Document(memoryStream, new LoadOptions{LoadFormat = LoadFormat.Pdf});
                HtmlSaveOptions htmlSaveOptions = new HtmlSaveOptions()
                {
                    ExportOriginalUrlForLinkedImages = true,
                    ExportPageSetup= true,
                    ImagesFolder = "C:\\Temp\\output\\img",
                    ImagesFolderAlias = "img",
                    // This option is true by deafult, but if you want to save the track changes information you cannot set it to false
                    ExportRoundtripInformation = true,
                };
                // You can save the file directly in you local file system or save it to Stream an then use the SharePoint Client to create a file in your SharePoint
                doc.Save("Target directory", htmlSaveOptions);
            }
            else
            {
                Console.WriteLine("File not found");
            }
        }

        Console.ReadLine();
    }
}

In the above example, replace "https://your-sharepoint-site-url" with the URL of your SharePoint site. Set the libraryName variable to the name of the document library where the file is located, and fileName to the name of the file you want to read.

Provide the appropriate SharePoint credentials by replacing "username" and "password" with valid credentials that have access to the SharePoint site.

eduardo.canal · May 19, 2023, 6:03pm

@SouvikMoun Please note that PDF and HTML formats have a different structure compared to MS Word file format. Since Aspose.Words exclusively works with MS Word file structure, it needs to perform a conversion when loading or saving files in these formats. In your case, working with the Aspose.Pdf API would be advantageous.

SouvikMoun · May 19, 2023, 6:56pm

Hi,

where stream is declared?

Document doc = new Document(stream, new LoadOptions{LoadFormat = LoadFormat.Pdf});

while I executed this line getting “Cannot access a closed Stream” error.

This is my below code snippet:

public async Task<string> GetHtmlDataByUrl(string pdfRelativeUrl)
        {
            string htmlData = string.Empty;
            if (pdfRelativeUrl != null)
            {
                string contentType = string.Empty;
                MemoryStream ms = await this.commonAppService.GetFileStream(pdfRelativeUrl).ConfigureAwait(false);
                byte[] byteArray = ms.ToArray();
                MemoryStream memory = new MemoryStream(byteArray);
                Document doc = new Document(memory, new LoadOptions { LoadFormat = LoadFormat.Pdf });
                HtmlSaveOptions htmlSaveOptions = new HtmlSaveOptions()
                {
                    ExportOriginalUrlForLinkedImages = true,
                    ExportPageSetup = true,
                    ImagesFolder = "C:\\Temp\\output\\img",
                    ImagesFolderAlias = "img",
                    // This option is true by deafult, but if you want to save the track changes information you cannot set it to false
                    ExportRoundtripInformation = true,
                    SaveFormat = SaveFormat.Html
                };
                MemoryStream memoryStream = new MemoryStream();
                doc.Save(memoryStream, htmlSaveOptions);                                        
                memoryStream.Position = 0;
                StreamReader reader = new StreamReader(memoryStream);
                htmlData = reader.ReadToEnd();
                htmlData = htmlData.Replace("\"", "'");
            }
            return htmlData;
        }

Can you please help on this.

Thank you.

eduardo.canal · May 19, 2023, 7:01pm

@SouvikMoun

where stream is declared?

Sorry got a typo when edit the code to post it in the forum, the correct syntax is the following:

// Create a MemoryStream from the byte array
MemoryStream memoryStream = new MemoryStream(byteArray);
Document doc = new Document(memoryStream, new LoadOptions{LoadFormat = LoadFormat.Pdf});

I’m reviewing your code now.

eduardo.canal · May 19, 2023, 7:08pm

SouvikMoun:

MemoryStream ms = await this.commonAppService.GetFileStream(pdfRelativeUrl).ConfigureAwait(false);
                byte[] byteArray = ms.ToArray();
                MemoryStream memory = new MemoryStream(byteArray);
                Document doc = new Document(memory, new LoadOptions { LoadFormat = LoadFormat.Pdf });

If you are already getting the data as a MemoryStream object you don’t need to convert it to byte array and then to MemoryStream object again.

Additionally, can you please share what’s the issue that you are having when you execute your code?

SouvikMoun · May 19, 2023, 7:11pm

Hi,
Please- find below error message

{
  "error": {
    "message": "Cannot access a closed Stream.",
    "innerExceptionMessage": "",
    "exception": "TechnicalException",
    "userFriendlyMessage": "Please contact administrator.",
    "customErrorCode": 111,
    "stackTrace": "   at System.IO.MemoryStream.set_Position(Int64 value)\r\n   at Aspose.Words.Document.\u0003 (Stream \u0002, LoadOptions \u0003)\r\n   at Aspose.Words.Document..ctor(Stream stream, LoadOptions loadOptions)\r\n   at NextGenESW.CMS.API.NextGenESW.CMS.Business.PublishedContent.PublishedSharepointAppService.GetHtmlDataByUrl(String pdfRelativeUrl) in C:\\Users\\pwesw2\\Source\\Repos\\NextGenESW.ContentManagement\\NextGenESW.CMS.API\\NextGenESW.CMS.Business\\PublishedContent\\PublishedSharepointAppService.cs:line 577\r\n   at NextGenESW.CMS.API.NextGenESW.CMS.Business.PublishedContent.PublishedSharepointAppService.GetAllExtractedDocumentAsync(Int32 sourceId, String contentType, String contentNo, Nullable`1 version) in C:\\Users\\pwesw2\\Source\\Repos\\NextGenESW.ContentManagement\\NextGenESW.CMS.API\\NextGenESW.CMS.Business\\PublishedContent\\PublishedSharepointAppService.cs:line 132\r\n   at NextGenESW.CMS.API.Controllers.ContentExportController.PublishedSharepointContentPDFEKS(String contentId, String contentType) in C:\\Users\\pwesw2\\Source\\Repos\\NextGenESW.ContentManagement\\NextGenESW.CMS.API\\Controllers\\ContentExportController.cs:line 206"
  }
}

Document doc = new Document(memory, new LoadOptions { LoadFormat = LoadFormat.Pdf });

In above line, we are getting the error

eduardo.canal · May 19, 2023, 7:24pm

@SouvikMoun probably the issue is caused by how you are getting the MemoryStream object
(await this.commonAppService.GetFileStream(pdfRelativeUrl).ConfigureAwait(false);)

In any case when you are working with streams it’s recommended close it (Dispose()) when you finish to use it. So, (and this depend completely of your architecture) it’s not a good practice to have methods returning Streams, the consumer must forgot dispose it after the usage.

eduardo.canal · May 19, 2023, 7:53pm

@SouvikMoun sorry, but I cannot replicate your issue with the data that you provide, can you please build a simple console application with the code that I shared with you and test, or attach to us a simple application that allow us reproduce your problem.