Hi There,
We were able to retrieve the JavaScript from a PDF using the code snippet below, and it used to work perfectly. However, recently we encountered an issue: when processing multiple files together, their PDF contents would become intermixed. Because of this, we stopped using this method for extracting JavaScript.
private void handleFacade(SanitizationRequest request, CleaningResponse response, CleanOptions ops) {
String PreFacadeTmp = "PreFacadeTmp.pdf";
Long startTime,endTime;
startTime = System.nanoTime();
try {
Files.copy(Paths.get(request.GetOutputPath()), Paths.get(PreFacadeTmp), StandardCopyOption.REPLACE_EXISTING);
} catch (Exception e) {
logger.warning("Failed to copy to temp file");
}
PdfJavaScriptStripper stripper = new PdfJavaScriptStripper();
stripper.strip(PreFacadeTmp, request.GetOutputPath());
try {
Files.deleteIfExists(Paths.get(PreFacadeTmp));
} catch (Exception e) {
logger.warning("Facade temp file not found for deletion.");
}
endTime = System.nanoTime();
response.setPartResult(SanitizablePart.JavaScript, PartResult.Extracted);
response.SetExecutionTime(SanitizablePart.JavaScript, (endTime - startTime));
}
Since the above code snippet was causing issues, I tried a different approach to collect the document actions and then check whether they were JavaScript actions.
private void processJsActions(List<JavascriptAction> jsActionsList, CleaningResponse response, Boolean shouldExtract) {
Long startTime,endTime;
for (JavascriptAction jsAction : jsActionsList) {
startTime = System.nanoTime();
if (jsAction == null) {
logger.info("Javascript-action is empty");
continue;
}
if (shouldExtract) {
try {
// Setting the script to null effectively removes the malicious payload.
jsAction.setScript(null);
} catch (Exception e) {
logger.warning("Failed to set script: " + e.getMessage());
}
response.setPartResult(SanitizablePart.JavaScript, PartResult.Extracted);
} else {
response.setPartResult(SanitizablePart.JavaScript, PartResult.Analyzed);
}
endTime = System.nanoTime();
response.SetExecutionTime(SanitizablePart.JavaScript, (endTime - startTime));
}
}
private void collectPageAndAnnotationActions(Document doc, List<JavascriptAction> jsActionsList) {
for (Page page : doc.getPages()) {
// Check Page actions (OnOpen, OnClose)
if (page.getActions() != null) {
if (page.getActions().getOnOpen() instanceof JavascriptAction) {
jsActionsList.add((JavascriptAction) page.getActions().getOnOpen());
}
if (page.getActions().getOnClose() instanceof JavascriptAction) {
jsActionsList.add((JavascriptAction) page.getActions().getOnClose());
}
}
for (Annotation annotation : page.getAnnotations()) {
for (PdfAction action : annotation.getPdfActions()) {
if (action instanceof JavascriptAction) {
jsActionsList.add((JavascriptAction) action);
}
}
}
}
}
private void collectNamedDocumentScripts(Document doc, List<JavascriptAction> jsActionsList) {
// Iterate through the keys (script names) in the JavaScript collection
for (String scriptName : doc.getJavaScript().getKeys()) {
String scriptContent = doc.getJavaScript().get_Item(scriptName);
if (scriptContent != null && !scriptContent.isEmpty()) {
logger.info("Found named document JavaScript: " + scriptName);
}
}
}
private void collectDocumentActions(Document doc, List<JavascriptAction> jsActionsList) {
DocumentActionCollection docActions = doc.getActions();
List<PdfAction> actions = Stream.of(docActions.getBeforeSaving(),
docActions.getAfterSaving(),
docActions.getBeforePrinting(),
docActions.getAfterPrinting(),
docActions.getBeforeClosing()) // Added BeforeClosing
.filter(Objects::nonNull)
.collect(Collectors.toList());
IAppointment openAction = doc.getOpenAction();
if (openAction instanceof PdfAction) {
actions.add((PdfAction) openAction);
}
for (PdfAction action : actions) {
if (action instanceof JavascriptAction) {
jsActionsList.add((JavascriptAction) action);
}
}
}
public void handleJavaScript(Document doc, CleaningResponse response, Boolean shouldExtract) {
List<JavascriptAction> allJsActions = new ArrayList<>();
//Collect Document-Level JavaScript Actions (Actions collection and OpenAction)
collectDocumentActions(doc, allJsActions);
//Collect Document-Level Named JavaScripts (JavaScript dictionary)
collectNamedDocumentScripts(doc, allJsActions);
//Collect Page-Level Actions and Annotation Actions
collectPageAndAnnotationActions(doc, allJsActions);
// Process the collected JavaScript actions
if (allJsActions.isEmpty()) {
logger.info("No Embeded javascript found ");
return;
}
processJsActions(allJsActions, response, shouldExtract);
}
The problem with the new approach is that it does not detect JavaScript in a PDF even when it is present. For example, I tested several documents that contain JavaScript; the old code snippet was able to detect it, but the newer approach could not.
Can you guide us on the best way to reliably retrieve JavaScript from PDF documents? Please let us know a method that works consistently.
EmbededJavaScript.pdf (851 Bytes)
JavaScriptAdded_output.pdf (30.3 KB)