Converting PDF to HTML. Splitted tables problem

Hello,


I’m working on conversion from PDF to HTML. I do this just like in this post in section “Save output HTML to a single stream with embedded resources”. I’ve got problem with tables. They’re splitted, but they shouldn’t be. I tried to set bottom and top margins of the pages to 0, but there’s no effect. I attach my PDF and HTML files. How can I achieve no free space between them? Is it possible to merge many small pages to one big page?

Hi Marcin,


Thanks for using our API’s.

I have tested the scenario and I am able to
notice the same problem. For the sake of correction, I have logged this problem
as PDFNEWNET-39333 in our issue tracking system. We will further
look into the details of this problem and will keep you updated on the status
of correction. Please be patient and spare us little time. We are sorry for
this inconvenience.

Now concerning to your requirement “Is it possible to merge many small pages to one big page”, do you mean adding pages as NUp or you have a requirement to add small pages to empty area on existing main page. Please share the details so we may reply accordingly.

I suppose that it may be a way around - merging many pages into one probably would solve splitted tables problem., but I’m not sure about it.

Hi Marcin,


Thanks for your feedback. I am afraid merging multiple pages into a single page feature is not supported in Aspose.Pdf. We have already logged a similar requirement PDFNEWNET-36455 in our issue tracking system. We have linked your thread to issue id as well and will update you as soon as it is resolved.

However we will recommend to wait for the fix of original issue.

Best Regards,

Hi Marcin,


Thanks for your patience. In reference to PDFNEWNET-39333, our product team has investigate the issue and found it is not a bug of Aspose.Pdf for .NET but the source PDF documet.

In APS there are no tables as logical entities, “tables” are just texts and lines put in correct places on the pages. As lines and texts on different (though adjacent) pages are not a one big logical tables-objects, therefore they are not “split” - their part-objects (texts and lines) are just shown equally to source PDF. To see picture that illustrates the situation, please add couple of lines(highlighted) into snippet as following, that will enforce converter draw lines that shows borders of original pages.

Document doc = new
Document(“c:/pdftest/InformacjeOZastepstwach.pdf”);<o:p></o:p>

// tune conversion params

HtmlSaveOptions newOptions = new HtmlSaveOptions();

newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;

newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;

newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;

newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;

newOptions.SplitIntoPages = false;// force write HTMLs of all pages into one output document

newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStream);

//we can use some non-existing puth as result file name - all real saving will be done

SaveOptions.BorderPartStyle borderStyle= new SaveOptions.BorderPartStyle();

borderStyle.LineType= SaveOptions.HtmlBorderLineType.Dotted;

borderStyle.Color= System.Drawing.Color.Blue;

newOptions.PageBorderIfAny = new SaveOptions.BorderInfo(borderStyle);

//in our custom method SavingToStream() (it's follows this one)

string outHtmlFile = @"c:\pdftest\Final_SomeUnexistingFile.html";

doc.Save(outHtmlFile, newOptions);

private static void SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)

{

byte[] resultHtmlAsBytes = new byte[htmlSavingInfo.ContentStream.Length];

htmlSavingInfo.ContentStream.Read(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);

// here You can use any writable stream, file stream is taken just as example

string fileName = "c:/pdftest/stream_out.html";

Stream outStream = File.OpenWrite(fileName);

outStream.Write(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);

}

In such case we can see borders of original pages shown in result HTML. Please look at attached screenshot. We can see that tables split because they are put too far from the edge in original PDF.

So, the reason of this situation is : lines ant texts that depicts tables are put too far from edges in original PDF. To correct the situation and enforce tables to be visually "glued" in result, it's necessary to correct original PDF.


We are sorry for the inconvenience caused.


Best Regards,