How to extract a complete test question containing pictures

feiwang · May 28, 2022, 4:07am

I have this type of test paper document, which contains test questions. I want to extract the test questions and save them to the database, but I can’t get the association between the test questions and pictures. I want to ask how I can extract a test question, including question stems and pictures. Thank you!PDFToDOC_out - 副本.docx (247.0 KB)

alexey.noskov · May 28, 2022, 4:46am

@feiwang As I can see your document was generated by Aspose.PDF by conversion from PDF to DOCX. Unfortunately, it is quite difficult to analyze such documents because all the content in the document is floating (represented by frames) and you cannot rely on the flow node order. Also, all shapes in your document are placed in one paragraph at the end of each page, so there is no way to associate them with content:

I think it would be easier to extract the required content directly from the source PDF document using Aspose.PDF.

feiwang · May 28, 2022, 4:50am

OK, I see. I will try to use PDF to parse. Thank you very much for your reply