Multi Column PDF to HTML

We are converting multicolumn Pdf to HTML. the output does not align the text as it appears in the PDF file.
Is there a way to convert multiple column PDF to HTML or extract text from multiple column pdf in proper format with styles?

@Shishir_Khadka

Thank you for contacting support.

Would you please share the source and generated file along with the code snippet that you are using for this conversion. Please ZIP and attach requested data so that we may proceed to help you out. Alternatively, you may generate images of each page in a PDF document, as explained over Convert PDF Pages, and use these images in a HTML file. Regarding text extraction, you can convert a PDF file to a TXT file as explained in Working with Text.

Hi, @Farhan.Raza,
I have attached sample pdf, html and text file. I have also included code snippet for the process. Please look at readme file for more details.
I have also included expected output files.
Let me know if you need anything else.

Thanks,
Shishir

Aspose Pdf.zip (94.4 KB)

@Shishir_Khadka

Thank you for sharing requested data.

Regarding PDF to HTML conversion, the HTML file generated by Aspose.PDF API in the folder actual_output appears to be most identical to the source PDF file. Please clarify if you want it to be single column HTML file that should appear different than source PDF file.

About PDF file to TXT file conversion, Aspose.PDF API can extract text even better while using below code snippet. However, extracted text includes header and footer which are not present when same PDF file is exported to a TXT file by Adobe Acrobat. Therefore, a ticket with ID PDFNET-44621 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

        // open document
        Document doc = new Document(dataDir + "sample_pdf.pdf");
        // create TextAbsorber object to extract text with formatting
        TextAbsorber absorber = new TextAbsorber();
        // set pure text formatting mode
        absorber.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
        // accept the absorber for all document's pages
        doc.Pages.Accept(absorber);
        // get the extracted text
        string extractedText = absorber.Text;
        // Create a writer and open the file
        TextWriter tw = new StreamWriter(dataDir + "extracted-text_18.4.txt");
        // Write a line of text to the file
        tw.WriteLine(extractedText);
        // Close the stream
        tw.Close();

We are sorry for the inconvenience.

thank you. I will give it a try.
for HTML conversion, Yes, I am expecting HTML to be one column instead of two columns as in the original pdf.
When I use adobe to convert it into HTML/Word, I get single column output which is what I would want from Aspose as well.

@Shishir_Khadka

Regarding different PDF to HTML conversion, a ticket with ID PDFNET-44625 has been logged in our issue management system for further investigation. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.

Thank you @Farhan.Raza.
This is a very important feature for us. Any estimate on when this would be resolved?

@Shishir_Khadka

The issue reported by you, PDFNET-44625, has recently been logged in our issue management system. It will be investigated on its due turn that can take several months owing to previously logged and critical issues. We appreciate your patience and comprehension in this regard.

However, we also offer Paid Support, where issues are used to be investigated with higher priority. Our customers, who have paid support subscription, report their issue there which are meant to be investigated urgently. In case your reported issue is a blocker, you may please consider subscribing for Paid Support. For further information, please visit Paid Support FAQ.