Hi Team,
We are extracting images from documents using word-aspose. we have received one of the new scenarios in a document. In this document, images are extracted but PDF files are shown as empty pages.
My code:
static String matches = "Fig.*(?:[ \\r\\n\\t].*)+|Scheme.*|Plate.*|Abbildung.*|Fig.*(?:[ \\r\\n\\t]*)+";
private static org.apache.logging.log4j.Logger logger = LogManager.getLogger(FixedGraphic.class);
static int count = 1;
static Resultjson rs;
public static void fixedImage(Document interimdoc) throws Exception {
String pdf;
NodeCollection shapes = interimdoc.getChildNodes(NodeType.SHAPE, true);
LayoutCollector collector = new LayoutCollector(interimdoc);
int imageIndex = 1;
for (Shape shape : (Iterable<Shape>)shapes)
{
String text="NoMatch";
try {
text=shape.getParentParagraph().getAncestor(NodeType.TABLE).getPreviousSibling().toString(SaveFormat.TEXT);
}
catch(Exception e) {
logger.info(e.getMessage());
}
try {
if (shape.hasImage() && !text.contains(AIE.docName))
{
String imgName ="FX" +imageIndex;
pdf = pdfFolder + imgName + ".pdf";
Document itermDoc = (Document)interimdoc.deepClone(false);
itermDoc.appendChild(itermDoc.importNode(
shape.getAncestor(NodeType.SECTION),
false,
ImportFormatMode.USE_DESTINATION_STYLES));
itermDoc.ensureMinimum();
Node importedShape = itermDoc.importNode(shape, true, ImportFormatMode.USE_DESTINATION_STYLES);
itermDoc.getFirstSection().getBody().getFirstParagraph().appendChild(importedShape);
// Save as PDF.
itermDoc.save(pdf);
imageIndex++;
}
}
catch(Exception e) {
logger.info(e.getMessage());
}
}
}
Input doc: davids_et_al_2022_05.docx (3.3 MB)
My output: FX1.pdf (38.7 KB)
Please do the needful. Thanks.