Convert PDF to RTF- HTML- and TEXT

ScottAC · December 10, 2012, 12:52am

Hi,

I need to convert a PDF file into RTF, HTML and TEXT format. I guess I'll need to use Aspose.PDF to convert the PDF into DOC format and then Aspose.Words to convert the DOC to RTF, HTML or TEXT. (I know Aspose.PDF can convert to HTML and TEXT already, but it can't do RTF and I'm already familiar with Aspose.Words and have my own methods for TEXT output).

Can this be done ?

If so, do you have a simple code sample (C#) of reading a PDF file, converting to DOC (without saving to disk) and then using Aspose.Words to save as RTF format ?

Cheers,

Scott

codewarior · December 10, 2012, 1:07am

ScottAC:
I need to convert a PDF file into RTF, HTML and TEXT format. I guess I’ll need to use Aspose.PDF to convert the PDF into DOC format and then Aspose.Words to convert the DOC to RTF, HTML or TEXT. (I know Aspose.PDF can convert to HTML and TEXT already, but it can’t do RTF and I’m already familiar with Aspose.Words and have my own methods for TEXT output).
Can this be done ?
If so, do you have a simple code sample (C#) of reading a PDF file, converting to DOC (without saving to disk) and then using Aspose.Words to save as RTF format ?

Hi Scott,

Thanks for contacting support and sorry for replying you late.

Yes you are correct. Aspose.Pdf for .NET supports the feature to convert PDF document into DOC, HTML, XML, TEXT, Tex and XPS format. In order to fulfill your requirement, you may consider using Aspose.Pdf for .NET to convert PDF file to DOC format and then try using Aspose.Words for .NET to convert DOC file to RTF format. However concerning to output in HTML and TEXT format, you may directly try using Aspose.Pdf for .NET to render the source PDF file to HTML and TEXT format. Please visit the following link for further information on

Convert PDF file into DOC format
Convert PDF file into HTML format
Extract Text from all the Pages using Text Device (save the output in simple text file)

You may try using Aspose.Pdf.Document.Save(StreamObject) to save the output in MemoryStream object and use the same stream to instantiate Aspose.Words.Document object.

tahir.manzoor · December 10, 2012, 12:26pm

Hi Scott,

I am representative of Aspose.Words. Thanks for your inquiry. Yes, you can convert Docx/Doc to RTF, HTML or TEXT using Aspose.Words. Please read Aspose.Words LoadFormat and SaveFormat form following documentation links:

Document doc = new Document(MyDir + "in.doc");

// Save document to RTF
doc.Save(MyDir + @"AsposeOut.rtf", SaveFormat.Rtf);

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

ScottAC · December 13, 2012, 11:24pm

I've tried this and have 3 problems:

Sample program attached.

.Net Aspose.PDF version 7.5.0.0

.Net Aspose.Words 11.7.0.0

1. When the output Doc file from Aspose.PDF is opened in Aspose.Words and saved again as a Doc file with another name the format is corrupted. It seems to be putting all the pages on one page over the top.

2. I can't use a MemoryStream to pass the Doc format data between Aspose.PDF and Aspose.Words. I think Aspose.PDF.Save is closing the MemoryStream.

3. It only converted the first 3 pages of the document. Is this because I'm runnng an unlicenced version at the moment while i evaluate the product ?

Cheers, Scott

codewarior · December 14, 2012, 1:18am

ScottAC:

1. When the output Doc file from Aspose.PDF is opened in Aspose.Words and saved again as a Doc file with another name the format is corrupted. It seems to be putting all the pages on one page over the top.

Hi Scott,

Thanks for sharing the details.

It seems to be an issue related to Aspose.Words because the output .doc generated with Aspose.Pdf is correct (See attached miford.doc). My fellow worker from respective team will further test this scenario and will share his findings.

ScottAC:
2. I can’t use a MemoryStream to pass the Doc format data between Aspose.PDF and Aspose.Words. I think Aspose.PDF.Save is closing the MemoryStream.

We already have logged this issue as PDFNEWNET-31684 in our issue tracking system. As soon as this issue gets resolved, we would be more than happy to update you with the status of correction.

ScottAC:
3. It only converted the first 3 pages of the document. Is this because I'm runnng an unlicenced version at the moment while i evaluate the product ?

Yes. Its a limitation in evaluation/trail version. However before you purchase our products, you may also request a 30 days temporary license. For further information, please visit Get a temporary license

We are sorry for your inconvenience.

tahir.manzoor · December 14, 2012, 2:18am

Hi Scott,

ScottAC:

When the output Doc file from Aspose.PDF is opened in Aspose.Words and saved again as a Doc file with another name the format is corrupted. It seems to be putting all the pages on one page over the top.

Thanks for your inquiry. I have tested the scenario and have not found any issue with output document (Doc) while using the latest version of Aspose.Words for .NET. Please use the latest version of Aspose.Words for .NET and find the output Doc files in the attachment.

I have done a little modification in your code, please use DocSaveOptions.Mode as DocSaveOptions.RecognitionMode.Flow as shown below. Please read below documentation for your reference.

https://docs.aspose.com/pdf/net/convert-pdf-to-word/

Aspose.Pdf.Document pdf = new Aspose.Pdf.Document(MyDir + "milford.pdf");
Aspose.Pdf.DocSaveOptions saveOptions = new Aspose.Pdf.DocSaveOptions();
saveOptions.Mode = Aspose.Pdf.DocSaveOptions.RecognitionMode.Flow;
pdf.Save(MyDir + "milford.doc", saveOptions);

Document doc = new Document(MyDir + "milford.doc");
doc.Save(MyDir + "AsposeOut.doc");

ScottAC · December 16, 2012, 6:00pm

Thanks. I didn't need to update the Aspose.Words version i had as the Aspose.PDF SaveOptions change seems to have fixed the problem.

I have used Aspose.Words a bit, and I already have a few routines that help me convert DOC files into HTML, TEXT, and RTF. One of those routines loops through Sections in a Document and strips out the headers and footers. I've noticed that in the Doc file save by Aspose.PDF that things that are obviously headers and footers in the PDF don't appear as headers and footers in the Doc file so when i try to strip them nothing happens. Can i strip them in Apose.PDF before saving the Doc file. It seems as though i might need to load the PDF into the Generator somehow as it will give me access to Sections, Paragraphs, etc... How can i do that ? In the same code i gave you the Aspose.Pdf.Document object i get when i first load the PDF doesn't have a Sections object ?.

FYI. Here's the stripping code i was using in Aspose.Words. I want a similar thing in Aspose.PDF.

foreach (Section sect in doc)

{ sect.ClearHeadersFooters(); }

I've attached a different PDF file that has a header in it. (Note this PDF file was actually created using Aspose.Words MailMerge from a Docx template that had a header defined.)

codewarior · December 17, 2012, 2:23am

ScottAC:

I have used Aspose.Words a bit, and I already have a few routines that help me convert DOC files into HTML, TEXT, and RTF. One of those routines loops through Sections in a Document and strips out the headers and footers. I’ve noticed that in the Doc file save by Aspose.PDF that things that are obviously headers and footers in the PDF don’t appear as headers and footers in the Doc file so when i try to strip them nothing happens.

Hi Scott,

It is a known problem and we already have logged it as PDFNEWNET-33612 in our issue tracing system.

ScottAC:

Can i strip them in Apose.PDF before saving the Doc file.

Aspose.Pdf for .NET supports the feature to add Header/Footer in existing PDF file but I am afraid currently it does not support the feature to remove headers in existing PDF file. For the sake of implementation, , I have logged this requirement in our
issue tracking system under New Features list as PDFNEWNET-34634. We will further investigate this requirement
in details and will keep you updated on the status of a correction.

ScottAC:

It seems as though i might need to load the PDF into the Generator somehow as it will give me access to Sections, Paragraphs, etc… How can i do that ?

Aspose.Pdf.Generator namespace provides the feature to create PDF files from scratch and it does not offer the capability to manipulate existing PDF files. However we have PdfFileStamp class present under Aspose.Pdf.Facades namespace which contains AddFooter(…) and AddHeader(…) methods to add Header/Footer in existing PDF files.

We are sorry for your inconvenience.

ScottAC · December 17, 2012, 5:01pm

Thankyou.

Any idea when PDFNEWNET-33612 is scheduled to be fixed ?

Cheers,

Scott

codewarior · December 18, 2012, 5:12am

ScottAC:

Any idea when PDFNEWNET-33612 is scheduled to be fixed ?

Hi Scott,

The development team has been busy resolving other priority issues and I am afraid the aforementioned issue is not yet resolved. Nevertheless I have requested the development team to share the ETA regarding its resolution. Please be patient and spare us little time.

We are sorry for this delay and inconvenience.

codewarior · December 19, 2012, 1:13am

Hi Scott,

I have further discussed with development team and as per our current estimates, we plan to get PDFNEWNET-33612 resolved in next release version of Aspose.Pdf for .NET 7.7.0 (which is expected to release in January-2013) but still its not a promise. We will try our level best to get this issue resolved by the said time.

Please be patient and spare us little time.

aspose.notifier · February 7, 2013, 9:41am

The issues you have found earlier (filed as PDFNEWNET-33612) have been fixed in Aspose.Pdf for .NET 7.7.0.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

ScottAC · March 25, 2013, 2:30am

Hi,

This still doesn't seem to be working for me.

Using the original "InsurerQuoteSchedule.pdf" file i uploaded and the code below, the output Word document does not have a header. Also, I attached a second file "Sample.pdf" that has a header and a footer and it doesn't have either when converted to Word.

using System;

using System.Collections.Specialized;

using System.Data;

using System.IO;

using System.Data.SqlClient;

using System.Xml;

using System.Net;

using System.Net.Mail;

using System.Threading;

using Aspose.Words;

using ASPOSEPDF = global::Aspose.Pdf;

using global::Aspose.Pdf.Generator;

namespace AsposePdfWords

{

public class StartRun

{

///

/// The main entry point for the application.

///

static int Main(string[] args)

{

String outputFile1 = "DocFromAsposePDF.doc";

// PROBLEM 1 - PDF file with Headers and Footers doesn't have any headers and footers when converted to Doc file

{

ASPOSEPDF.Document PDFdoc = new ASPOSEPDF.Document("InsurerQuoteSchedule.pdf");

ASPOSEPDF.DocSaveOptions saveOptions = new ASPOSEPDF.DocSaveOptions();

saveOptions.Mode = ASPOSEPDF.DocSaveOptions.RecognitionMode.Flow;

PDFdoc.Save(outputFile1, saveOptions);

}

Console.WriteLine("Press any key to continue...");

Console.ReadKey(true);

return 0;

}}}

codewarior · March 25, 2013, 10:02am

Hi Scott,

Thanks for contacting support.

I have again tested the scenario where I have tried converting InsurerQuoteSchedule.pdf and Sample.pdf file to DOC format and as per my observations, the Header/Footer are not properly appearing. When converting Sample.pdf to DOC format, the contents of Header/Footer are appearing as normal text paragraphs inside DOC file. Furthermore, I have noticed that when converting InsurerQuoteSchedule.pdf file, the contents (text, Image) are appearing in Header area but they do not appear inside Header object.

I have reopened the issue PDFNEWNET-33612 and have associated this information with it and have intimated the development team to again look into this matter. We are really sorry for this inconvenience.

codewarior · March 26, 2013, 2:45pm

Hi Scott,

Thanks for your patience.

We have further looked into the requirement of removing Header/Footer from PDF file and in order to fulfill this requirement, the PdfFileStamp.StampId property was added. This property allows setting an identifier for a newly created stamp (including header, footer, page number). An added stamp may be removed using PdfContentEditor; we can add a header, footer, and page number by giving each an arbitrary ID, then remove them later.

This feature will become available in the upcoming release version of Aspose.Pdf for .NET 7.8.0.

PdfFileStamp pfe = new PdfFileStamp("PdfWithSeveralPages.pdf", "34634.pdf");

// 100 is stampId for footer
pfe.StampId = 100;
pfe.AddFooter(new FormattedText("Footer"), 10);

// 200 is stampId for header
pfe.StampId = 200;
pfe.AddHeader(new FormattedText("Header"), 10);

// 300 is stampId for page number
pfe.StampId = 300;
pfe.AddPageNumber(new FormattedText("Page #", System.Drawing.Color.Red, System.Drawing.Color.Blue));
pfe.Close();

PdfContentEditor pce = new PdfContentEditor();
pce.BindPdf("34634.pdf");
StampInfo[] stamps = pce.GetStamps(1);
Console.WriteLine(stamps.Length);
Assert.AreEqual(3, stamps.Length);

// show found stamps IDs
foreach (StampInfo info in stamps)
{
    Console.WriteLine(info.StampId);
}

// remove header, footer, and page number
pce.DeleteStampById(100);
pce.DeleteStampById(200);
pce.DeleteStampById(300);
pce.Save("34634-1.pdf");

PdfContentEditor pce1 = new PdfContentEditor();
pce1.BindPdf("34634-1.pdf");

ScottAC · March 26, 2013, 4:34pm

Sorry, but I don't understand how that will solve my problem. Is this a response to my message from yesterday or something further back in the topic ?

I'm not creating the PDF files, I'm just trying to convert PDF files into other formats (RTF, HTML, etc...) and I need to remove the headers and footers.

The first step i do is to convert the PDF file into a DOC file. I just need the headers and footers in the PDF files to appear as headers and footers in the DOC file.

I had attached sample PDF files that had headers and footers.

aspose.notifier · March 27, 2013, 6:14am

The issues you have found earlier (filed as PDFNEWNET-34634) have been fixed in Aspose.Pdf for .NET 7.8.0update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

codewarior · March 27, 2013, 1:15pm

ScottAC:

but I don’t understand how that will solve my problem. Is this a response to my message from yesterday or something further back in the topic ?

Hi Scott,

The earlier response which I have shared is related to your requested on removing Header/Footer from PDF file.

ScottAC:

I’m not creating the PDF files, I’m just trying to convert PDF files into other formats (RTF, HTML, etc…) and I need to remove the headers and footers.

In my response shared over 454116, I have shared the code snippet which can be used to remove Header/Footer from PDF file. Once the Header/Footer is removed, you can convert the files into other formats.

ScottAC:

The first step i do is to convert the PDF file into a DOC file. I just need the headers and footers in the PDF files to appear as headers and footers in the DOC file.

As shared over 453737, the problem related to Header/Footers not properly appearing as Header/Footers in DOC file, still persists. The team is looking into the details of this issue PDFNEWNET-33612 and as soon as we have made some significant progress towards its resolution, we would be more than happy to update you with the status of correction. Please be patient and spare us little time.

We are sorry for this inconvenience.

ScottAC · July 4, 2013, 10:47pm

Hi,

As requested I have been patient and spared you a little time. Are there any further updates on this problem PDFNEWNET_33612.

Scott

codewarior · July 5, 2013, 2:47am

Hi Scott,

Thanks for your patience.

The development team has been busy resolving other priority issues and I am afraid the above stated problem is not yet resolved. Nevertheless, I have requested the development team to share any possible ETA. As soon as I have some updates regarding its resolution, I would be more than happy to update you with the status of correction.

We are sorry for this delay and inconvenience.