I need to extract questions from PDF files which may or may not have pictures and or tables associated with those.
I have tried most of the PDFtoHTML examples provided as well as some others.
I thought of converting the PDF file to HTML so then I could run through and split that code by question, but I need to keep the structure I had before so I can reconstruct the question afterwards without loosing much of the format.
One of my main issue is that each pdf page has a border that I’m trying to remove it so it isn’t saved as an image, since I’m not going to use it after. There is also a bar code on the bottom of the page.
I would like to know if there is a way of removing this first so then I can convert to html and get only the content of the page.
I have also tried to use TextAbsorber and create a rectangle so I was only using the body content of my file but then I was losing the images because the way I could save the data was using aspose Words saving in html format.
Then I added ImageAbsorber with ImagePlacements to add my images but because I loose the reference of the image position in the beginning then I need to add it again.
It seems that I’m going around the subject, striping everything out and reconstructing the pdf contents and thought that there might be a better way of doing this?
Another thing that concerns me is that, when I convert to html using the examples provided, a lot of the images inside of the pdf loose parts, like subtitles and arrows pointing into a position of the image and it doesn’t seem to help if I choose to save as PNG or SVG. Sometimes, with SVG is saved an image with the size of the page instead of the area of the image only.
Could you explain why this happens to?
Sorry for the long post. I am using .NET.
Any help would be appreciated. Thanks
I was able to find a way of doing this and was going to test with other samples but after I’ve installed the new Visual Studio Community 2022 last week and replacing with a new temporary license it stopped working.
Here is the error: System.InvalidOperationException: ‘Invalid font’
Here is the code, samples and desired output: files.7z (1.3 MB)
Could you tell me if this could be related to the update or with the different license?
The Aspose.PDF was updated to 22.4.0 yesterday and the project rebuild but it still doesn’t work.
Here is the code I use to select an area of the PDF document, then to convert it to HTML.
var pdfDocument = new Aspose.Pdf.Document(@“C:\Users\spontes\Desktop\9979_GBY11_Summer2016.pdf”);
// Delete the first page
// Create new Box Rectagle
var newBox = new Rectangle(50, 139, 550, 790);
// run through the pages
for (int i=1; i <= pdfDocument.Pages.Count; i++)
pdfDocument.Pages[i].CropBox = newBox;
pdfDocument.Pages[i].ArtBox = newBox;
pdfDocument.Save(@"C:\Users\spontes\Desktop\crop_page_sample.pdf");//just a guide
//next convert to html
// Save the file into MS document format
The “invalid font” error appears on the last step, when converting to HTML.
Just to let you know that I was able to fix this.
I changed the root of the documents and renamed everything, then it started to work again…not sure what happened.
Another question I have is this, now I got an SVG per page converted but I was wondering is it possible to get it embed on the selected region, instead of the PNG?
Or is it possible to get the extracted content in a more accessible way, i.e., that the images and/or the text can be scaled? Because sometimes we get an image in the middle of the page that gets converted to an SVG with the size of the page/region when we only need that middle part…
Please find attached an input file and some output files that I have manually created.
I would like to get that cropped region in individual files if possible.
I thought of getting the original file tagged or bookmarked for the region I need (each question) then crop it and save it in a format that I could render somewhere else.