Process protected/encrypted pdf file

zhilin39 · May 21, 2021, 6:19am

Hi,

Few things to clarify:

Does Apose PDF(.Net) support processing of protected/encrypted pdf file ?
I have a protected pdf and I remove JavaScript using Aspose PDF C# (V21.5), it seem like after processing it cause the pdf to lose some information ?

With regard to Q2, I have 2 protected/encrypted pdf samples to remove JavaScript before sending to a scanning engine. From the scanning engine’s result, one sample(3.pdf) was detected as protected/encrypted while the other one(1.pdf) was not detected as protected/encrypted.

It seen like some information was lost during the removal of JavaScript but I’m not too sure what information was removed. So I would like know does removing JavaScript from pdf affect the effectiveness of protected/encrypted pdf.

Here’s my sample files and program that I have used for testing:
sample.zip (169.8 KB)

thanks

mudassir.fayyaz · May 21, 2021, 12:19pm

@zhilin39

I have noticed same output for both files 1.pdf and 3.pdf. Do you get different results on the console? Can you please share your output snapshots on console if it is different on your end.

zhilin39 · May 24, 2021, 2:15am

Hi @mudassir.fayyaz ,

I think you didn’t get my qns right. Just to emphasize, the two samples should be protected but the sample program with Aspose pdf lib shows that only 3.pdf is protected/encrypted but not 1.pdf.

On top of that I open these files using a hex editor to compare the content:

Before removing javascript (1.pdf)
before_removing_js_1_pdf.PNG (56.4 KB)

After removing javascript (1.pdf)
after_removing_js_1_pdf.PNG (93.0 KB)

Before removing javascript (3.pdf)
before_removing_js_3_pdf.PNG (57.1 KB)

After removing javascript (3.pdf)
after_removing_js_3_pdf.PNG (87.6 KB)

scanning result (after removing javascript)
scanner_result.png (3.8 KB)

Based on the comparison above, it seems like removing JavaScript from pdf have modified/damaged one of the encryption metadata objects. Thus my question for Aspose in the previous post.

Hope based on these information, you have an understanding of my qns and help to answer it. thanks

mudassir.fayyaz · May 24, 2021, 10:15am

@zhilin39

Thanks for the details and snapshots. It is helpful in understanding your scenario. However, can you please explain a bit more about the scanning engine you are referring to. Is it some internal system or publicly accessible that we can take a look at while investigating this issue.

Please share some details about it so that we can help you further.

zhilin39 · May 25, 2021, 2:56am

Hi @mudassir.fayyaz ,

I don’t think the scanning engine is publicly accessible and I think I’m unable to share more details about it. Sorry about it. thanks

mudassir.fayyaz · May 25, 2021, 2:24pm

@zhilin39

Aspose.PDF mimics the behavior of Adobe Reader and follows its standards. We can not notice any issue with encryption of 1.pdf file so maybe that scanning engine needs to be checked. If there is any issue with the file then Adobe Reader would not show the file as Protected/Encrypted.

zhilin39 · May 28, 2021, 3:26am

Hi,

Can I know, do you have any issue analyzing 3.pdf the one after removing javascript? thanks

mudassir.fayyaz · May 28, 2021, 2:55pm

@zhilin39

A ticket with ID PDFNET-49978 has been created in our issue tracking system to further investigate the issue on our end. We will share our feedback soon.

zhilin39 · June 7, 2021, 7:58am

@mudassir.fayyaz,

In the sample project, I’m using 2 methods to remove Javascript:

Function 1: Loop through JavaScript.Keys and remove it
Function 2: Using PdfJavaScriptStripper.Strip

Is it possible to tell me how the pdf structure get modified by these functions and what is the difference ?
thanks

mudassir.fayyaz · June 7, 2021, 4:21pm

@zhilin39

We will be able to share details once the issue is investigated. Please be patient and spare us some time.

zhilin39 · June 24, 2021, 2:21am

Hi @mudassir.fayyaz,

I have gotten additional info we get from the scanning vendor and hope it will be useful for your investigation:
The Dict/Value is a conceptual object inside the PDF’s metadata that represents a decryption key. (referring to my previous post on open up the pdf using hex editor)

In case you still don’t get which part I’m referring, I provided one sample that indicate the dict/value(decryption key):
sample.png (9.7 KB)

PDF spec describes several ways of storing encrypted data in order to protect user’s data from rendering. The PDF’s data is indeed encrypted but the PDF renderer is decrypting in runtime and so – the document’s data is protected from editing (even though a save-as option is valid) but is still visible.

Comparing the PDF before and after removing JavaScript, some binary strings are re-encoded(probably converted to another encoding) during cleansing process so there is definitely process that changes existing representations. This specific binary string is used for default decrypting and thus probably cause the issue among several parsers they tried. The output is not valid under PDF spec rules since the string encoding is partial (some characters were not converted).

It will be good that if removing JavaScript will not affect the encryption key. thanks

mudassir.fayyaz · June 24, 2021, 1:25pm

@zhilin39

We have noted your comments and will surely inform you as soon as the logged ticket is resolved.

zhilin39 · August 19, 2021, 3:30am

HI @mudassir.fayyaz,

Any updates ? This issue is affecting our user in production, hope this issue can resolve as soon as possible.

thanks

mudassir.fayyaz · August 19, 2021, 9:37pm

@zhilin39

Please note that it was logged in free support model and will be investigated and resolved on a first come first serve basis. We will surely inform you as soon as we make some definite progress towards its resolution. Please be patient and spare us some time.