What is the best approach to change page size/orientation when converting HTML to pdf

ksubramaniyam · March 16, 2016, 10:29pm

I am evaluating your component for converting HTML to pdf and I require ability to control the page sizes after the HTML has been imported into Aspose. Although an example is not necessary for my question, I would illustrate one scenario where this is needed. Refer to the attached html file and the 3rd page, if this is converted to PDF the 3rd page’s table will be cut off under normal settings.

Let’s say the HTML has 3 pages, and page 3 contains a very large table. If page 3 is rendered using 8.5x11, half the table would overflow outside the page. We prefer to print pages in standard 8.5x11 format when possible. But for page 3, we need to set the page size to 17x11 (or something bigger than 8.5x11)

What is the best way to achieve this?

1) Are there any code/style that we can specify in the HTML/CSS to control the page size for certain pages (in this case 3rd page).

2) We tried using TableAbsorber as per link below to set the 3rd page’s size and didn’t see the page dimension changing whne the pdf was changed. There was also a bug with this that I have separately logged in another thread.

Working with Tables in PDF using C#|Aspose.PDF for .NET

3) We tried using PDFPageEditor as shown on the link below. The page was set to appropriate size but the content that was overflowing outside the initial page was still lost (ie: seems the page size information is required during the html to pdf conversion process and not after??)

Changing page sizes in PDF file|Aspose.PDF for .NET

4) I know we can set the htmloptions while converting HTML to PDF to specify the page size (before conversion) but this setting is applied globally to the whole html string (all pages) and we don’t see how to apply it to only page 3.

So the questions we have are:

1) What is your recommended way to achieve what we are looking for - basically, once html is imported in, how do we go about changing certain pages to larger page size and recover any overflown content or how to specify tags/styles in html to control page sizing when converting.

2) Is there documentation on how the HTML is converted to PDF? Basically, we would like to know what tags/styles/css are supported to research, on our own, solutions for the above. Does aspose.pdf support @page css rule for printing etc?

3) Do you support any page level support for zooming, scaling, etc at the page level that can be controlled from the html to achieve this?

The attached example is a simple one where last column gets cut off, but we do have cases where bigger portions of the table gets cut off.

Thanks for your help.

Kuna

tilal.ahmad · March 17, 2016, 10:45pm

Hi Kuna,

Thanks for your inquiry. We are looking into it and will update you soon.

Best Regards,

ksubramaniyam · March 21, 2016, 1:29pm

any updates?

we have tried many alternatives and seem to have run against walls (bugs or APIs not working as we would expect). Any direction on how to resize certain pages and have the converted HTML content reflow, would be appreciated!

tilal.ahmad · March 23, 2016, 2:23pm

Hi Kuna,

We are sorry for the inconvenience. I am afraid we can not set width/height of resultant PDF document less than minimal width/height of a html page during HTML to PDF conversion. However as a workaround we can resize the contents of resultant PDF document using ResizeContents() method of PdfFileEditor class, you may resize some specific page or whole PDF document as following.

HtmlLoadOptions htmloptions = new HtmlLoadOptions();

// Load HTML file into a Document object
Document doc = new Document("aspose.source.html", htmloptions);

// Process paragraphs to ensure all content is adjusted properly
doc.processParagraphs();

// Create an array for page numbers to resize
int[] pageNumbers = new int[doc.getPages().size()];
for (int i = 0; i < doc.getPages().size(); i++) {
    pageNumbers[i] = i + 1;
}

// Create PdfFileEditor and set resize parameters
PdfFileEditor pfe = new PdfFileEditor();
PdfFileEditor.ContentsResizeParameters resizeParams = PdfFileEditor.ContentsResizeParameters.pageResize(17 * 72, 11 * 72);
pfe.resizeContents(doc, pageNumbers, resizeParams);

// Save the resized document as PDF
doc.save("aspose.source.pdf");

Best Regards,

ksubramaniyam · March 29, 2016, 1:32pm

Hi,

Please ensure you read this carefully to understand the issue we are facing. Your previous solution doesn’t seem to address our issue (see below).

Issue:

We are trying to convert HTML to PDF using Aspose.PDF. We have attached a sample HTML file (test2.html) which contains some tables all set to 100% width. When converting this HTML to PDF, some of the tables get cut off. Our challenge is to build a workaround solution for this by detecting content overflow, up the page size, reconvert.

Limitations

At the time of HTML to PDF conversion, we won’t know the structure of the HTML to preset any page size settings.

Our goal is to use the smallest paper size possible.

Our problem

We have not found a fool proof way to detect content overflow (see the code for what we have tried). When we calculate the overflow and apply a new page size, conversion works correctly in most cases but not in cases like the one illustrated here.

Tests

Find the code we are using in"Code.txt" file.

Test of your solution

Test #1, pdf_1_8x11.pdf: Straight conversion with default settings to illustrate the overflow content.

Test #2, pdf_1_11x17.pdf: Using your solution with a larger page size. Overflow is still there.

Test #3, pdf_1_23x31.pdf: Using your solution with the largest page size. Overflow is still there.

Conclusion:

Your solution merely scales the PDF and does not help us with recovering overflow content.

Our tests

Now we try to set the page size before conversion:

Test #4, pdf_2_a4.pdf: Overflow gets better

Test #5, pdf_2_a3.pdf: Overflow gets better

Test #6, pdf_2_a2.pdf: Overflow gone but still not good with margins

Test #7, pdf_2_a1.pdf: Best solution

Observations:

1) Obviously, we cannot convert our HTML at A1 page size all the time. Nor is there a need to. You will see that the same HTML can be printed on 8.5x11 from the browser fine or converted to PDF through the browser to 8.5x11 size fine (file: pdf_2_chrome.pdf).

2) The code that we have come up with to detect if there is overflow content, doesn’t work here. Refer to function “HasOverflowContent” and the resultant log file generated “log.txt”. Is there a better API to detect content overflows or current “correct” page size? We use Page.CalculateContentBBox() API. We have tried playing with BleedBox, ArtBox, CropBox, TrimBox and Page.Rect but have not had success in detecting overflow.

3) You will notice another anomaly that’s making it harder to fix the issue. When converted using A4, the 2nd table’s last column gets cut off, but when we use A3 (larger size), the same result is seen. We would expect larger page sizes to show more content.

4) Besides there is obviously an issue with tables having 100% width setting being rendered differently till paper size is set to A2/A1. This inconsistent table width interpretation is probably another bug on your end.

Conclusion:

We believe #1,#3 and #4 are bugs. There is no need to require A1 paper size to fit the HTML content that fits perfectly well in 8.5x11 page (or even smaller).

Having pointed out the bugs, we need to find a work around to the bugs in the mean time. If we can detect that a page is, in fact, in overflow situation, we can try larger and larger page sizes programmatically till the largest size is reached or no overflow is detected.

NOTE: the example provided works at A2 and higher. But we have other cases that work at 11x17 or other paper sizes (or in landscape mode). We need a dynamic solution that can detect and adjust accordingly. Our goal is to use the smallest paper size possible.

What we need:

For now we only need help with detecting page overflows 100% of the time. Please provide the code that can detect if the current page is in overflow situation or not and, preferably, what’s the ideal size for the page would be.

Thanks

Kuna

ksubramaniyam · March 29, 2016, 1:42pm

quick update: there is an issue with the code we sent, in function “HasOverflowCotnent”

the line:

If (_page.CalculateContentBBox().Width + pdf.PageInfo.Margin.Right - df.PageInfo.Margin.Left) > _page.TrimBox.Width Then

should be:

If (_page.CalculateContentBBox().Width + pdf.PageInfo.Margin.Right + pdf.PageInfo.Margin.Left) > _page.TrimBox.Width Then

The issue with detecting overflow 100% still remains.

codewarior · March 31, 2016, 1:29am

Hi Kuna,

Thanks for sharing the details. We are working on testing the scenario in our environment and will get back to you soon.

ksubramaniyam · April 4, 2016, 9:12am

it’s been almost a week. any updates on this?

the question boils down to "After HTML to PDF conversion, what is is the best way/API to detect if a page has overflow content (content that go outside of the page margins)?"

You can try going through my sample but it’s not necessary as we are looking for your recommendation on the best API/way to achieve this.

tilal.ahmad · April 4, 2016, 11:36am

Hi Kuna,

We are sorry for the delayed response. I have checked your shared sample code and noticed that you are not using the suggested solution so data is trimming in the resultant PDF document. For default conversion please do not set page margins/width and later resize the PDF document to desired page settings. please check following amended code snippet and sample output files attached, especially pdf_1_8x11.pdf. Hopefully it will help you to accomplish the task,

However if you still want to detect the page content overflow then please confirm so we will investigate the option accordingly.

Dim _htmlLoadOptions As New Aspose.Pdf.HtmlLoadOptions()
Dim _sb As New StringBuilder()
Dim _testPath As String = "C:\Users\Home\Downloads\test (1)\Temp"

' Setup load options
' _htmlLoadOptions.PageInfo.Margin.Right = 20
' _htmlLoadOptions.PageInfo.Margin.Left = 20
' _htmlLoadOptions.PageInfo.Margin.Top = 20
' _htmlLoadOptions.PageInfo.Margin.Bottom = 20

' Test 1: Default conversion
Dim _pdf As New Aspose.Pdf.Document(_testPath & "test2.html", _htmlLoadOptions)
_pdf.Save(_testPath & "pdf_default.pdf", Aspose.Pdf.SaveFormat.Pdf)
_sb.AppendLine("Test 1: " & HasOverflowContent(_pdf))

' Test 1a: Resize to 8x11
_pdf = New Aspose.Pdf.Document(_testPath & "test2.html", _htmlLoadOptions)
ResizePagesInPlace(_pdf, 8, 11)
_pdf.Save(_testPath & "pdf_1_8x11.pdf", Aspose.Pdf.SaveFormat.Pdf)
_sb.AppendLine("Test 1a: " & HasOverflowContent(_pdf))

' Test 2: Resize to 11x17
_pdf = New Aspose.Pdf.Document(_testPath & "pdf_1_8x11.pdf")
ResizePagesInPlace(_pdf, 11, 17)
_pdf.Save(_testPath & "pdf_1_11x17.pdf", Aspose.Pdf.SaveFormat.Pdf)
_sb.AppendLine("Test 2: " & HasOverflowContent(_pdf))

' Additional logic...

We are sorry for the inconvenience.

Best Regards,

ksubramaniyam · April 5, 2016, 12:10pm

Thanks for your reply.

Now I do understand why your original solution wasn’t working for us. But this solution is not feasible for us for number of reasons as illustrated below. We believe this to be a bug anyway as we do not see why setting margins would impact the conversion process.

This doesn’t work for us for the following reasons:

1) We need to be able to control margins.

2) Page resizing after the conversion is using scaling and things get either stretched or skewed. Not acceptable to us. (Initial page size ends up being 12.21x11.69. Odd size that doesn’t match any paper size and this varies based on content - again not a good thing for PDfs that need to be printed)

3) We need to convert 300 page+ documents into PDF and only one or two pages may contain large tables. With this solution, we will always be converting all the pages at the highest width required.With #2, the end result is usually not acceptable from presentation/readability point of view.

In any event, there is no need for us to convert all content at a higher page size when 98% of our pages would fit very well on 8.5x11 and would look better in this setting.

Our solution to overcome the shortcoming is as follows:

1) Start with 8.5x11 page size

2) Convert each Section of the HTML to PDF

3) Check if the section has overflow content.

4) Set the page size to higher

5) Repeat step 3-4 for the section till there is no overflow

6) Stitch together/merge all the sectional PDFs into one PDF.

This ensures that all the HTML is converted to 8.5x11 size and only the sections that overflow get converted at a higher page size. This allows us to convert the pages naturally to a proper paper size that can be printed. This seems a better compromise for us so we can maintain better 1-1 conversion quality (and avoids scaling/distortion issues).

It all works well for us as we had mentioned earlier but the “HasOverflowContent” doesn’t work in all cases. It does a decent job with about 80-90% of the cases we have tested but fails in cases like the one we outlined in the code above.

Getting passed your solution above, is there a solution to detect content overflows better? Can you suggest how we can fix our HasOverflowContent function?

Let’s stick to answering this question with our “HasOverflowContent”, if you can suggest a solution to make it work 100%, we can work out the rest of the logic to convert properly on our end.

Thanks

Kuna

tilal.ahmad · April 6, 2016, 11:46am

Hi Kuna,

Thanks for sharing the details, I am looking into it and will update you soon.

Best Regard,

codewarior · April 6, 2016, 1:40pm

Hi Kuna,

Thanks for sharing the details. We are looking into this matter and will get back to you soon.

tilal.ahmad · April 7, 2016, 12:18pm

Hi Kuna,

ksubramaniyam@dymaxium.com:
Thanks for your reply.

Now I do understand why your original solution wasn’t working for us. But this solution is not feasible for us for number of reasons as illustrated below. We believe this to be a bug anyway as we do not see why setting margins would impact the conversion process.

This doesn’t work for us for the following reasons:
1) We need to be able to control margins.

2) Page resizing after the conversion is using scaling and things get either stretched or skewed. Not acceptable to us. (Initial page size ends up being 12.21x11.69. Odd size that doesn’t match any paper size and this varies based on content - again not a good thing for PDfs that need to be printed)

3) We need to convert 300 page+ documents into PDF and only one or two pages may contain large tables. With this solution, we will always be converting all the pages at the highest width required.With #2, the end result is usually not acceptable from presentation/readability point of view.

In any event, there is no need for us to convert all content at a higher page size when 98% of our pages would fit very well on 8.5x11 and would look better in this setting.

Our solution to overcome the shortcoming is as follows:

1) Start with 8.5x11 page size
2) Convert each Section of the HTML to PDF
3) Check if the section has overflow content.
4) Set the page size to higher
5) Repeat step 3-4 for the section till there is no overflow
6) Stitch together/merge all the sectional PDFs into one PDF.

This ensures that all the HTML is converted to 8.5x11 size and only the sections that overflow get converted at a higher page size. This allows us to convert the pages naturally to a proper paper size that can be printed. This seems a better compromise for us so we can maintain better 1-1 conversion quality (and avoids scaling/distortion issues).

It all works well for us as we had mentioned earlier but the “HasOverflowContent” doesn’t work in all cases. It does a decent job with about 80-90% of the cases we have tested but fails in cases like the one we outlined in the code above.

Getting passed your solution above, is there a solution to detect content overflows better? Can you suggest how we can fix our HasOverflowContent function?

Let’s stick to answering this question with our “HasOverflowContent”, if you can suggest a solution to make it work 100%, we can work out the rest of the logic to convert properly on our end.

Thanks for your detailed feedback. After initial investigation we have logged an investigation ticket PDFNEWNET-40545 in our issue tracking system. Our product team will analyse the scenario and we will suggest you any solution accordingly.

We are sorry for the inconvenience caused.

Best Regards,

ksubramaniyam · April 11, 2016, 2:37pm

Can you let me know if you have any updates?

If not, Is your investigation system different from bug tracking? Are you able to provide an estimate on when we can expect an answer.

Again, I have the feeling that you may not be getting what we are asking. There is no need to investigate what I posted (it’s only meant to illustrate known bug and our workaround for it). We are looking for an API, for our workaround, that allows detection of overflow to make our HasOverflowContent function work.

Thanks

Kuna

tilal.ahmad · April 12, 2016, 10:27pm

Hi Kuna,

Thanks for your inquriy. I am afraid the reported issue is still pending for investigation in the queue with other issues, as we have recently logged the issue. As soon as our product team completes the investigation of the issue then we will share any ETA/findings with you.

Furthermore, I understood your requirement about HasOverflowContent method along with the page size issue in HTML to PDF conversion and shared the details with the product team.We will keep you updated about the issue resolution progress.

We are sorry for the inconvenience.

Best Regards,