Hello,
I need to extract questions from PDF files which may or may not have pictures and or tables associated with those.
I have tried most of the PDFtoHTML examples provided as well as some others.
I thought of converting the PDF file to HTML so then I could run through and split that code by question, but I need to keep the structure I had before so I can reconstruct the question afterwards without loosing much of the format.
One of my main issue is that each pdf page has a border that I’m trying to remove it so it isn’t saved as an image, since I’m not going to use it after. There is also a bar code on the bottom of the page.
I would like to know if there is a way of removing this first so then I can convert to html and get only the content of the page.
I have also tried to use TextAbsorber and create a rectangle so I was only using the body content of my file but then I was losing the images because the way I could save the data was using aspose Words saving in html format.
Then I added ImageAbsorber with ImagePlacements to add my images but because I loose the reference of the image position in the beginning then I need to add it again.
It seems that I’m going around the subject, striping everything out and reconstructing the pdf contents and thought that there might be a better way of doing this?
Another thing that concerns me is that, when I convert to html using the examples provided, a lot of the images inside of the pdf loose parts, like subtitles and arrows pointing into a position of the image and it doesn’t seem to help if I choose to save as PNG or SVG. Sometimes, with SVG is saved an image with the size of the page instead of the area of the image only.
Could you explain why this happens to?
Sorry for the long post. I am using .NET.
Any help would be appreciated. Thanks