Extracting content from PDF region to HTML

spontes · March 29, 2022, 3:00pm

Hello,

I need to extract questions from PDF files which may or may not have pictures and or tables associated with those.

I have tried most of the PDFtoHTML examples provided as well as some others.

I thought of converting the PDF file to HTML so then I could run through and split that code by question, but I need to keep the structure I had before so I can reconstruct the question afterwards without loosing much of the format.

One of my main issue is that each pdf page has a border that I’m trying to remove it so it isn’t saved as an image, since I’m not going to use it after. There is also a bar code on the bottom of the page.
I would like to know if there is a way of removing this first so then I can convert to html and get only the content of the page.

I have also tried to use TextAbsorber and create a rectangle so I was only using the body content of my file but then I was losing the images because the way I could save the data was using aspose Words saving in html format.
Then I added ImageAbsorber with ImagePlacements to add my images but because I loose the reference of the image position in the beginning then I need to add it again.
It seems that I’m going around the subject, striping everything out and reconstructing the pdf contents and thought that there might be a better way of doing this?

Another thing that concerns me is that, when I convert to html using the examples provided, a lot of the images inside of the pdf loose parts, like subtitles and arrows pointing into a position of the image and it doesn’t seem to help if I choose to save as PNG or SVG. Sometimes, with SVG is saved an image with the size of the page instead of the area of the image only.
Could you explain why this happens to?

Sorry for the long post. I am using .NET.
Any help would be appreciated. Thanks

tahir.manzoor · March 29, 2022, 4:24pm

@spontes

To ensure a timely and accurate response, please attach the following resources here for testing:

Your input PDF.
Please attach the output HTML that shows the undesired behavior.
Please attach the expected output HTML that shows the desired behavior.
Please create a standalone console application (source code without compilation errors) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we will start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip and upload them.

spontes · May 9, 2022, 11:34am

Hello,

I was able to find a way of doing this and was going to test with other samples but after I’ve installed the new Visual Studio Community 2022 last week and replacing with a new temporary license it stopped working.

Here is the error: System.InvalidOperationException: ‘Invalid font’

Here is the code, samples and desired output: files.7z (1.3 MB)

Could you tell me if this could be related to the update or with the different license?

Many thanks,
Silvia

tahir.manzoor · May 9, 2022, 4:22pm

@spontes

We have converted the shared PDF to HTML using the latest version of Aspose.PDF for .NET 22.4 and have not found the shared issue. So, please use Aspose.PDF for .NET 22.4.

If you still face problem, please share the requested code example here for testing. Thanks for your cooperation.

spontes · May 10, 2022, 9:11am

The Aspose.PDF was updated to 22.4.0 yesterday and the project rebuild but it still doesn’t work.

Here is the code I use to select an area of the PDF document, then to convert it to HTML.

var pdfDocument = new Aspose.Pdf.Document(@“C:\Users\spontes\Desktop\9979_GBY11_Summer2016.pdf”);

        // Delete the first page
        pdfDocument.Pages.Delete(1);
        
        // Create new Box Rectagle
        var newBox = new Rectangle(50, 139, 550, 790);
        
        // run through the pages
        for (int i=1; i <= pdfDocument.Pages.Count; i++)
        {
            pdfDocument.Pages[i].CropBox = newBox;
            pdfDocument.Pages[i].ArtBox = newBox;
        }
        pdfDocument.Save(@"C:\Users\spontes\Desktop\crop_page_sample.pdf");//just a guide

        //next convert to html
        
        // Save the file into MS document format
        pdfDocument.Save(@"C:\Users\spontes\Desktop\crop_page_sample.html", SaveFormat.Html);

The “invalid font” error appears on the last step, when converting to HTML.

Thanks for the support

tahir.manzoor · May 10, 2022, 5:30pm

@spontes

We have logged this problem in our issue tracking system as PDFNET-51755. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

spontes · May 25, 2022, 9:26am

Just to let you know that I was able to fix this.
I changed the root of the documents and renamed everything, then it started to work again…not sure what happened.

Another question I have is this, now I got an SVG per page converted but I was wondering is it possible to get it embed on the selected region, instead of the PNG?
Or is it possible to get the extracted content in a more accessible way, i.e., that the images and/or the text can be scaled? Because sometimes we get an image in the middle of the page that gets converted to an SVG with the size of the page/region when we only need that middle part…

Thanks

tahir.manzoor · May 25, 2022, 4:26pm

@spontes

It is nice to hear from you that your problem has been resolved. We have closed the issue PDFNET-51755.

Could you please share some more detail of your requirement along with input and expected output files?

spontes · May 30, 2022, 8:18am

example.zip (356.1 KB)

Please find attached an input file and some output files that I have manually created.
I would like to get that cropped region in individual files if possible.
I thought of getting the original file tagged or bookmarked for the region I need (each question) then crop it and save it in a format that I could render somewhere else.

Any help would be very much appreciated. Thanks

tahir.manzoor · May 30, 2022, 4:36pm

@spontes

We have logged a ticket in our issue tracking system as PDFNET-51869 for your case. We will inform you via this forum thread once there is an update available on it.

We apologize for your inconvenience.