How to get heading tags a part of the Pdf to html conversion

muraliar · March 31, 2021, 4:48am

We are currently using aspose.pdf to convert pdf to Html. But want to check, if there is an option to generate heading, italic, bold tags while converting pdf to Html. If I am converting the same pdf to Html using Microsoft word it generates the heading tags. Also is there a way for removing header and footer type content from the pdf before converting to Html?

asad.ali · March 31, 2021, 5:15pm

@muraliar

Could you please share source PDF and output HTML files for our reference along with the sample code snippet? Also, please share the HTML generated from MS Word. We will test the scenario in our environment and address it accordingly.

You can surely search the text within a specified rectangle (header/footer) and remove it using TextFragmentAbsorber Class.

// instantiate TextFragment Absorber object
Aspose.Pdf.Text.TextFragmentAbsorber TextFragmentAbsorberAddress = new Aspose.Pdf.Text.TextFragmentAbsorber();
// search text within page bound
TextFragmentAbsorberAddress.TextSearchOptions.LimitToPageBounds = true;
// specify the page region for TextSearch Options
TextFragmentAbsorberAddress.TextSearchOptions.Rectangle = new Aspose.Pdf.Rectangle(0, page.PageInfo.Height - 72, page.PageInfo.Width, page.PageInfo.Height);
// search text from first page of PDF file
page.Accept(TextFragmentAbsorberAddress);

muraliar · April 3, 2021, 5:42am

Thanks for the tip on removing the specific text using TextFragmentAbsorber Class.

I will provide the sample files and the code snippet in couple of days regarding the other issue.

asad.ali · April 5, 2021, 8:15am

@muraliar

Sure, please take your time to gather the material to share.