Unable to extract javaScript actions from PDF

Hi There,

We were able to retrieve the JavaScript from a PDF using the code snippet below, and it used to work perfectly. However, recently we encountered an issue: when processing multiple files together, their PDF contents would become intermixed. Because of this, we stopped using this method for extracting JavaScript.

 private void handleFacade(SanitizationRequest request, CleaningResponse response, CleanOptions ops) {
        String PreFacadeTmp = "PreFacadeTmp.pdf";
        Long startTime,endTime;
        startTime = System.nanoTime();
        try {
            Files.copy(Paths.get(request.GetOutputPath()), Paths.get(PreFacadeTmp), StandardCopyOption.REPLACE_EXISTING);
        } catch (Exception e) {
            logger.warning("Failed to copy to temp file");
        }

        PdfJavaScriptStripper stripper = new PdfJavaScriptStripper();
        stripper.strip(PreFacadeTmp, request.GetOutputPath());

        try {
            Files.deleteIfExists(Paths.get(PreFacadeTmp));
        } catch (Exception e) {
            logger.warning("Facade temp file not found for deletion.");
        }
        endTime = System.nanoTime();
        response.setPartResult(SanitizablePart.JavaScript, PartResult.Extracted);
        response.SetExecutionTime(SanitizablePart.JavaScript, (endTime - startTime));
    }

Since the above code snippet was causing issues, I tried a different approach to collect the document actions and then check whether they were JavaScript actions.

private void processJsActions(List<JavascriptAction> jsActionsList, CleaningResponse response, Boolean shouldExtract) {
        Long startTime,endTime;
        for (JavascriptAction jsAction : jsActionsList) {
            startTime = System.nanoTime();
            if (jsAction == null) {
                logger.info("Javascript-action is empty");
                continue;
            }

            if (shouldExtract) {
                try {
                    // Setting the script to null effectively removes the malicious payload.
                    jsAction.setScript(null);
                } catch (Exception e) {
                    logger.warning("Failed to set script: " + e.getMessage());
                }
                response.setPartResult(SanitizablePart.JavaScript, PartResult.Extracted);
            } else {
                response.setPartResult(SanitizablePart.JavaScript, PartResult.Analyzed);
            }
            endTime = System.nanoTime();
            response.SetExecutionTime(SanitizablePart.JavaScript, (endTime - startTime));
        }
    }

    private void collectPageAndAnnotationActions(Document doc, List<JavascriptAction> jsActionsList) {
        for (Page page : doc.getPages()) {
            // Check Page actions (OnOpen, OnClose)
            if (page.getActions() != null) {
                if (page.getActions().getOnOpen() instanceof JavascriptAction) {
                    jsActionsList.add((JavascriptAction) page.getActions().getOnOpen());
                }
                if (page.getActions().getOnClose() instanceof JavascriptAction) {
                    jsActionsList.add((JavascriptAction) page.getActions().getOnClose());
                }
            }
            for (Annotation annotation : page.getAnnotations())  {
                for (PdfAction action : annotation.getPdfActions()) {
                    if (action instanceof JavascriptAction) {
                        jsActionsList.add((JavascriptAction) action);
                    }
                }
            }
        }
    }

    private void collectNamedDocumentScripts(Document doc, List<JavascriptAction> jsActionsList) {
        // Iterate through the keys (script names) in the JavaScript collection
        for (String scriptName : doc.getJavaScript().getKeys()) {
            String scriptContent = doc.getJavaScript().get_Item(scriptName);

            if (scriptContent != null && !scriptContent.isEmpty()) {
                logger.info("Found named document JavaScript: " + scriptName);
            }
        }
    }

    private void collectDocumentActions(Document doc, List<JavascriptAction> jsActionsList) {
        DocumentActionCollection docActions = doc.getActions();

        List<PdfAction> actions = Stream.of(docActions.getBeforeSaving(),
                        docActions.getAfterSaving(),
                        docActions.getBeforePrinting(),
                        docActions.getAfterPrinting(),
                        docActions.getBeforeClosing()) // Added BeforeClosing
                .filter(Objects::nonNull)
                .collect(Collectors.toList());

        IAppointment openAction = doc.getOpenAction();
        if (openAction instanceof PdfAction) {
            actions.add((PdfAction) openAction);
        }

        for (PdfAction action : actions) {
            if (action instanceof JavascriptAction) {
                jsActionsList.add((JavascriptAction) action);
            }
        }
    }

    public void handleJavaScript(Document doc, CleaningResponse response, Boolean shouldExtract) {
        List<JavascriptAction> allJsActions = new ArrayList<>();

        //Collect Document-Level JavaScript Actions (Actions collection and OpenAction)
        collectDocumentActions(doc, allJsActions);

        //Collect Document-Level Named JavaScripts (JavaScript dictionary)
        collectNamedDocumentScripts(doc, allJsActions);

        //Collect Page-Level Actions and Annotation Actions
        collectPageAndAnnotationActions(doc, allJsActions);

        // Process the collected JavaScript actions
        if (allJsActions.isEmpty()) {
            logger.info("No Embeded javascript found ");
            return;
        }

        processJsActions(allJsActions, response, shouldExtract);
    }

The problem with the new approach is that it does not detect JavaScript in a PDF even when it is present. For example, I tested several documents that contain JavaScript; the old code snippet was able to detect it, but the newer approach could not.

Can you guide us on the best way to reliably retrieve JavaScript from PDF documents? Please let us know a method that works consistently.
EmbededJavaScript.pdf (851 Bytes)

JavaScriptAdded_output.pdf (30.3 KB)

@vivekk930703

Summary

You can reliably extract (and optionally remove) JavaScript from a PDF with Aspose.PDF by iterating over all places where JavaScript can be stored:

  1. Document‑level JavaScript dictionary (doc.JavaScript).
  2. Document‑level actions (doc.Actions and doc.OpenAction).
  3. Page‑level actions (page.Actions).
  4. Annotation actions (annotation.PdfActions).
    Collect each JavascriptAction, read its script, and, if you need to strip it, set the script to null (or replace it with an empty string). Use a new Document instance for each file – do not reuse the same instance or static helpers, otherwise data from previous files can leak into the next one.

Why the original approaches failed

  • PdfJavaScriptStripper works by writing a temporary copy of the PDF and then stripping the script. When you process many files in parallel or reuse the same temporary file name (PreFacadeTmp.pdf), the internal buffers are shared and the content of one PDF can be mixed with another, which is why you observed inter‑mixing.
  • The second approach missed some JavaScript because it only looked at a subset of the possible locations. For example, named JavaScript entries stored in the JavaScript name dictionary (doc.JavaScript) are not JavascriptAction objects, so they were never added to jsActionsList. Also, the code did not handle the case where doc.Actions returns null for some members, causing those actions to be skipped.

What to do next – reliable extraction code (Aspose.PDF for .NET)

using Aspose.Pdf;
using Aspose.Pdf.Actions;
using System.Collections.Generic;

public class PdfJsExtractor
{
 public static List<JavascriptAction> CollectAllJavascript(Document pdf)
 {
 var jsList = new List<JavascriptAction>();

 // 1. Document‑level named JavaScripts (JavaScript dictionary)
 if (pdf.JavaScript != null && pdf.JavaScript.Count > 0)
 {
 foreach (string name in pdf.JavaScript.Keys)
 {
 // The dictionary stores the script as a plain string, not a JavascriptAction.
 // If you need to treat it uniformly, wrap it in a dummy JavascriptAction.
 var dummy = new JavascriptAction(pdf.JavaScript[name]);
 jsList.Add(dummy);
 }
 }

 // 2. Document‑level actions collection
 DocumentActionCollection docActs = pdf.Actions;
 AddIfJs(docActs?.BeforeSaving, jsList);
 AddIfJs(docActs?.AfterSaving, jsList);
 AddIfJs(docActs?.BeforePrinting, jsList);
 AddIfJs(docActs?.AfterPrinting, jsList);
 AddIfJs(docActs?.BeforeClosing, jsList);
 AddIfJs(docActs?.AfterClosing, jsList);

 // OpenAction (can be a JavascriptAction)
 if (pdf.OpenAction is JavascriptAction openJs)
 jsList.Add(openJs);

 // 3. Page‑level actions and annotation actions
 foreach (Page page in pdf.Pages)
 {
 // Page actions (OnOpen, OnClose)
 if (page.Actions?.OnOpen is JavascriptAction pageOpen)
 jsList.Add(pageOpen);
 if (page.Actions?.OnClose is JavascriptAction pageClose)
 jsList.Add(pageClose);

 // Annotation actions
 foreach (Annotation annot in page.Annotations)
 {
 foreach (PdfAction act in annot.PdfActions)
 {
 if (act is JavascriptAction jsAct)
 jsList.Add(jsAct);
 }
 }
 }

 return jsList;
 }

 private static void AddIfJs(PdfAction act, List<JavascriptAction> list)
 {
 if (act is JavascriptAction js)
 list.Add(js);
 }

 // Example usage
 public static void ExtractOrStrip(string inputPath, string outputPath, bool strip)
 {
 // Load a fresh Document for each file – no shared static state.
 var pdf = new Document(inputPath);
 var allJs = CollectAllJavascript(pdf);

 foreach (var js in allJs)
 {
 // The script text (for named scripts we wrapped a dummy action)
 string script = js.Script;
 Console.WriteLine($"Found JavaScript: {script}\n");

 if (strip)
 js.Script = string.Empty; // or null – both remove the payload
 }

 if (strip)
 pdf.Save(outputPath); // saved PDF no longer contains the scripts
 }
}

Key points in the code

  • CollectAllJavascript gathers scripts from every possible location.
  • Named JavaScripts are wrapped in a temporary JavascriptAction so the rest of the pipeline can treat them uniformly.
  • The strip flag demonstrates how to remove the script (js.Script = string.Empty).
  • A new Document instance is created for each file, eliminating the inter‑mixing problem.

Additional recommendations

  1. Upgrade to the latest Aspose.PDF version – earlier releases had bugs where some action collections returned null even when a script existed.
  2. Process files sequentially or use separate temporary file names if you must work in parallel; never reuse the same temp file (PreFacadeTmp.pdf).
  3. Log the source (document‑level, page‑level, annotation) for each script you find – this helps debugging and compliance reporting.
  4. If you only need the raw JavaScript strings (no need to modify the PDF), you can stop after the Console.WriteLine part and skip the pdf.Save call.

What to do next

  • Replace the old PdfJavaScriptStripper workflow with the ExtractOrStrip method above.
  • Verify on a few PDFs that contain:
  • Named JavaScripts (e.g., /JavaScript << /JS (alert('hi')) >>).
  • Document actions (/OpenAction).
  • Page actions (/AA << /O << /S /JavaScript /JS (… ) >> >>).
  • If any script is still missing, enable Aspose.PDF logging (PdfDocumentInfo.SetLogLevel(LogLevel.Debug)) to see which objects are being parsed.
  • Report any remaining gaps to Aspose support, providing the PDF sample and the Aspose.PDF version you are using.

:warning: I could not fully verify this explanation against the retrieved sources. Please double-check the behaviour in your environment and confirm the exact SDK/version.

SDK version unclear — refer to the official release notes before applying this guidance.

what’s wrong in this way ?

Whatever the recommendation is, I am already using it in that way, but it still does not detect the JavaScript. You can check it by running it against the attachment.

I have used Aspose PDF 23.8 and 24.12 — the problem exists in both versions.(am i missing something ?)