Hi,
I want to convert pdf file (which located in SharePoint) to html data with font styles and images using aspose.word and C#. Please help on this topic.
Regards,
Souvik
Hi,
I want to convert pdf file (which located in SharePoint) to html data with font styles and images using aspose.word and C#. Please help on this topic.
Regards,
Souvik
@SouvikMoun
To read a file in SharePoint using C#, you can utilize the SharePoint Client Object Model (CSOM) or SharePoint REST API. Here’s an example of how to read a file using CSOM:
First, you’ll need to add references to the required SharePoint CSOM assemblies in your project. You can find these assemblies in the SharePoint Client Components SDK.
After that you can load your file as byte array and use the Aspose.Words Document constructor that read the file from a Stream.
See the following example:
using System;
using Microsoft.SharePoint.Client;
using Aspose.Words;
using Aspose.Words.Loading;
class Program
{
static void Main(string[] args)
{
string siteUrl = "https://your-sharepoint-site-url";
string libraryName = "Documents";
string fileName = "example.pdf";
using (ClientContext context = new ClientContext(siteUrl))
{
// Provide SharePoint credentials
context.Credentials = new SharePointOnlineCredentials("username", "password");
// Retrieve the file
List sharedDocumentsList = context.Web.Lists.GetByTitle(libraryName);
context.Load(sharedDocumentsList);
context.ExecuteQuery();
CamlQuery query = CamlQuery.CreateAllItemsQuery();
ListItemCollection items = sharedDocumentsList.GetItems(query);
context.Load(items);
context.ExecuteQuery();
// Find the file by its name
ListItem fileItem = items.Cast<ListItem>()
.FirstOrDefault(item => item.FileSystemObjectType == FileSystemObjectType.File
&& item["FileLeafRef"].ToString() == fileName);
if (fileItem != null)
{
// Load the file data
context.Load(fileItem.File);
context.ExecuteQuery();
// Read the file data
FileInformation fileInfo = File.OpenBinaryDirect(context, fileItem.File.ServerRelativeUrl);
byte[] fileContent = new byte[fileInfo.Stream.Length];
fileInfo.Stream.Read(fileContent, 0, fileContent.Length);
// Create a MemoryStream from the byte array
MemoryStream memoryStream = new MemoryStream(byteArray);
Document doc = new Document(memoryStream, new LoadOptions{LoadFormat = LoadFormat.Pdf});
HtmlSaveOptions htmlSaveOptions = new HtmlSaveOptions()
{
ExportOriginalUrlForLinkedImages = true,
ExportPageSetup= true,
ImagesFolder = "C:\\Temp\\output\\img",
ImagesFolderAlias = "img",
// This option is true by deafult, but if you want to save the track changes information you cannot set it to false
ExportRoundtripInformation = true,
};
// You can save the file directly in you local file system or save it to Stream an then use the SharePoint Client to create a file in your SharePoint
doc.Save("Target directory", htmlSaveOptions);
}
else
{
Console.WriteLine("File not found");
}
}
Console.ReadLine();
}
}
In the above example, replace "https://your-sharepoint-site-url"
with the URL of your SharePoint site. Set the libraryName
variable to the name of the document library where the file is located, and fileName
to the name of the file you want to read.
Provide the appropriate SharePoint credentials by replacing "username"
and "password"
with valid credentials that have access to the SharePoint site.
@SouvikMoun Please note that PDF and HTML formats have a different structure compared to MS Word file format. Since Aspose.Words exclusively works with MS Word file structure, it needs to perform a conversion when loading or saving files in these formats. In your case, working with the Aspose.Pdf API would be advantageous.
Hi,
where stream is declared?
Document doc = new Document(stream, new LoadOptions{LoadFormat = LoadFormat.Pdf});
while I executed this line getting “Cannot access a closed Stream” error.
This is my below code snippet:
public async Task<string> GetHtmlDataByUrl(string pdfRelativeUrl)
{
string htmlData = string.Empty;
if (pdfRelativeUrl != null)
{
string contentType = string.Empty;
MemoryStream ms = await this.commonAppService.GetFileStream(pdfRelativeUrl).ConfigureAwait(false);
byte[] byteArray = ms.ToArray();
MemoryStream memory = new MemoryStream(byteArray);
Document doc = new Document(memory, new LoadOptions { LoadFormat = LoadFormat.Pdf });
HtmlSaveOptions htmlSaveOptions = new HtmlSaveOptions()
{
ExportOriginalUrlForLinkedImages = true,
ExportPageSetup = true,
ImagesFolder = "C:\\Temp\\output\\img",
ImagesFolderAlias = "img",
// This option is true by deafult, but if you want to save the track changes information you cannot set it to false
ExportRoundtripInformation = true,
SaveFormat = SaveFormat.Html
};
MemoryStream memoryStream = new MemoryStream();
doc.Save(memoryStream, htmlSaveOptions);
memoryStream.Position = 0;
StreamReader reader = new StreamReader(memoryStream);
htmlData = reader.ReadToEnd();
htmlData = htmlData.Replace("\"", "'");
}
return htmlData;
}
Can you please help on this.
Thank you.
where stream is declared?
Sorry got a typo when edit the code to post it in the forum, the correct syntax is the following:
// Create a MemoryStream from the byte array
MemoryStream memoryStream = new MemoryStream(byteArray);
Document doc = new Document(memoryStream, new LoadOptions{LoadFormat = LoadFormat.Pdf});
I’m reviewing your code now.
If you are already getting the data as a MemoryStream
object you don’t need to convert it to byte array and then to MemoryStream
object again.
Additionally, can you please share what’s the issue that you are having when you execute your code?
Hi,
Please- find below error message
{
"error": {
"message": "Cannot access a closed Stream.",
"innerExceptionMessage": "",
"exception": "TechnicalException",
"userFriendlyMessage": "Please contact administrator.",
"customErrorCode": 111,
"stackTrace": " at System.IO.MemoryStream.set_Position(Int64 value)\r\n at Aspose.Words.Document.\u0003 (Stream \u0002, LoadOptions \u0003)\r\n at Aspose.Words.Document..ctor(Stream stream, LoadOptions loadOptions)\r\n at NextGenESW.CMS.API.NextGenESW.CMS.Business.PublishedContent.PublishedSharepointAppService.GetHtmlDataByUrl(String pdfRelativeUrl) in C:\\Users\\pwesw2\\Source\\Repos\\NextGenESW.ContentManagement\\NextGenESW.CMS.API\\NextGenESW.CMS.Business\\PublishedContent\\PublishedSharepointAppService.cs:line 577\r\n at NextGenESW.CMS.API.NextGenESW.CMS.Business.PublishedContent.PublishedSharepointAppService.GetAllExtractedDocumentAsync(Int32 sourceId, String contentType, String contentNo, Nullable`1 version) in C:\\Users\\pwesw2\\Source\\Repos\\NextGenESW.ContentManagement\\NextGenESW.CMS.API\\NextGenESW.CMS.Business\\PublishedContent\\PublishedSharepointAppService.cs:line 132\r\n at NextGenESW.CMS.API.Controllers.ContentExportController.PublishedSharepointContentPDFEKS(String contentId, String contentType) in C:\\Users\\pwesw2\\Source\\Repos\\NextGenESW.ContentManagement\\NextGenESW.CMS.API\\Controllers\\ContentExportController.cs:line 206"
}
}
Document doc = new Document(memory, new LoadOptions { LoadFormat = LoadFormat.Pdf });
In above line, we are getting the error
@SouvikMoun probably the issue is caused by how you are getting the MemoryStream object
(await this.commonAppService.GetFileStream(pdfRelativeUrl).ConfigureAwait(false);
)
In any case when you are working with streams it’s recommended close it (Dispose()
) when you finish to use it. So, (and this depend completely of your architecture) it’s not a good practice to have methods returning Streams, the consumer must forgot dispose it after the usage.
@SouvikMoun sorry, but I cannot replicate your issue with the data that you provide, can you please build a simple console application with the code that I shared with you and test, or attach to us a simple application that allow us reproduce your problem.