Detecting zero days in software supply chain with static and dynamic analysis

Posted by Ajin Abraham on Jan 25 2021

TL;DR

This blog shares some ideas about detecting zero-days in the software supply chain even before they get flagged by your typical Software Composition Analysis (SCA) or Dependency checking tools. Also shares the proof of concept code to detect malicious behavior using static and dynamic analysis techniques on third-party dependencies before the build process in CI/CD pipelines.

Introduction

The recent SolarWinds fiasco showed us how tech giants with solid and matured security programs got targeted from a software supply chain attack. Supply chain attacks have a wide scope, but I will focus on the Application Security aspect of it involving code and data. As DevOps gets matured, we see a lot of code being built and published automatically from CI/CD solutions both on-premise and via SaaS. The majority of effort is being put on protecting production systems where this code runs and a little or medium importance is given to build systems or we offload that responsibility to the service provider. Within the build pipeline, we use security tools that perform Software Composition Analysis or Dependency checking to detect outdated packages, abandoned packages, and packages with known vulnerabilities, etc. There is an OWASP category for this called A9:2017-Using Components with Known Vulnerabilities.

What about the packages with unknown vulnerabilities, the ones that got backdoored recently, or the ones with a zero-day? A majority of existing tools work by keeping a database of known vulnerabilities in packages and your security depends on how updated this database is. In case of a zero-day, it is already too late to defend against the harm as some of these databases are updated after the information is publically available. In this post, we will discuss some ideas to proactively detect previously unknown malicious behavior in third party dependencies or malicious dependencies that take advantage of typosquatting.

Malicious Packages in CI/CD Pipelines

Build systems are usually dynamically created and destroyed during the process of building and deploying or publishing assets. An attacker's point of entry is when the build system tries to setup/install an attacker controlled malicious package.

If the malicious package gets installed during the build, an attacker can perform some of these activities in the context of the build system:

  1.  Steal code and any hardcoded sensitive data along with it.
  2.  Plant a backdoor in code to be used after the code is deployed to the production environment.
  3.  Steal compute resources like CPU, RAM, etc. for activities like crypto mining.
  4.  Steal environment variables, sensitive files, credentials, certificates, etc.
  5.  Perform lateral movement and privilege escalation with the data collected. 

Since we cannot cover everything in this blog, I will focus on the environment variables aspect. All the code shared below is just a proof of concept and not production-ready.

Stealing Environment Variables

The ideas discussed here are applicable cross-platform across different programming languages, but the examples shared will focus on python packages. I have created a malicious python package called poc-rogue that will try various methods to steal environment variables and send that data to localhost:1337 when you try to install it. Let's take a look at some of the approaches to get environment variables from a python program

Use python API os.environ

return os.environ

Run env command

subprocess.check_output(['env'])

Run shell's built-in set command

subprocess.check_output(['sh', '-c', 'set'])

Read environ from proc psudo file-system (/proc/<pid>/environ)

loc = Path('/proc') / str(os.getpid()) / 'environ'
return loc.read_text()

Read files that can contain environment variables

data = []
commons = {
    '/etc/environment', '/etc/profile', '/etc/bashrc',
    '~/.bash_profile', '~/.bashrc', '~/.profile',
    '~/.cshrc', '~/.zshrc', '~/.tcshrc',
}
for i in commons:
    env = Path(i).expanduser().read_text()
    data.append(env)

Access environ pointer from libc.so shared library

libc = ctypes.CDLL(None)
environ = ctypes.POINTER(ctypes.c_char_p).in_dll(libc, 'environ')

These are some of the common ways to access environment variables. There are other ways available as well but for simplicity let's stick with these.

Static Analysis to Detect Malicious Python Packages

Let's do some static analysis to detect malicious packages. We will use semgrep for static analysis. All the code used here is available at package_scan repo. The following are the semgrep static analysis rules to detect access of environment variables.

Semgrep rule syntax is very easy to follow as it uses the programming language's syntax to define rules. To better understand semgrep syntax, take a look at https://semgrep.dev/docs/. The package_scan repo has a file requirements.txt that defines the packages against which we need to run the static analysis.

rsa>=4.7
biplist>=1.0.3
bs4>=0.0.1
colorlog>=4.7.2
shelljob>=0.6
-e git://github.com/ajinabraham/poc-rogue.git#egg=rogue

Among other python packages, we have our poc-rogue package in the list as well. Let's use the python script static_analysis.py to perform static analysis.

Neat, we see issues flagged only from our poc-rogue package.  This python script will download all the dependencies defined in requirements.txt and their sub dependencies as source code and run semgrep against the code with the ruleset that we have provided. Static analysis can help us detect the low-hanging fruits with ease. But static analysis has its limitations against obfuscated code and there are different permutations of code and APIs to achieve the same logic. So writing detection rules for everything might not scale up in the long run. Hence we also need dynamic analysis to be more precise with our detections.

Dynamic Analysis of Malicious Packages

For dynamic analysis, we can make use of syscalls as they work at a lower level and is the same irrespective of your high-level programming languages like Python, Node.js, etc. There are multiple ways to trace system calls like using extended Berkeley Packet Filter (eBPF) probes, seccomp-bpf filters that use classic BPF, using ptrace API, etc. For keeping things simple, we use the strace utlity with --seccomp-bpf flag that uses seccomp-bpf filters to collect the system call and ptrace API to deep inspect arguments. Let's try to understand how syscalls can help us with dynamic analysis. If you look into the malicious code, some of those generates syscalls.

Executing Commands

Consider a malicious code that executes certain commands to collect environment variables. 

$ strace -f -e trace=execve -o strace python -c 'import subprocess;subprocess.call(["env"])'
$ cat strace
431765 execve("/home/ajin/package_scan/venv/bin/python", ["python", "-c", "import subprocess;subprocess.cal"...], 0x7ffee0ac8c48 /* 28 vars */) = 0
431766 execve("/home/ajin/package_scan/venv/bin/env", ["env"], 0x7fff8fa0b308 /* 28 vars */) = -1 ENOENT (No such file or directory)
431766 execve("/home/ajin/.local/bin/env", ["env"], 0x7fff8fa0b308 /* 28 vars */) = -1 ENOENT (No such file or directory)
431766 execve("/usr/local/sbin/env", ["env"], 0x7fff8fa0b308 /* 28 vars */) = -1 ENOENT (No such file or directory)
431766 execve("/usr/local/bin/env", ["env"], 0x7fff8fa0b308 /* 28 vars */) = -1 ENOENT (No such file or directory)
431766 execve("/usr/sbin/env", ["env"], 0x7fff8fa0b308 /* 28 vars */) = -1 ENOENT (No such file or directory)
431766 execve("/usr/bin/env", ["env"], 0x7fff8fa0b308 /* 28 vars */) = 0
431766 +++ exited with 0 +++
431765 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=431766, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
431765 +++ exited with 0 +++

I will explain the strace arguments later. But for now, whatever python API you use for command execution, it has to finally reach the execve() family of syscalls for executing the commands.

Opening Files

Consider the reading of sensitive files that has environment variables.

strace -f -e trace=open,openat -o strace python -c 'from pathlib import Path; Path("~/.bashrc").expanduser().read_text()'
$ cat strace | grep bashrc
432709 openat(AT_FDCWD, "/home/ajin/.bashrc", O_RDONLY|O_CLOEXEC) = 3

Any file opening operation from standard python API is handled by the openat() or open() family of syscalls in the kernel.

Network Connections

Another example is to use syscalls to identify outbound network connections used by attackers to exfiltrate data.

$ strace -f -e trace=connect -o strace python -c 'import urllib.request;urllib.request.urlopen("http://python.org/")'
$ cat strace | grep 'htons(80)'
435764 connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("45.55.99.72")}, 16) = 0

You can see that for connecting to python.org, the connect() syscall was used. 

Implementing Dynamic Analysis

So with this knowledge, we can use strace to capture all the sensitive system calls invoked when a package is installed and look for patterns that correspond to malicious behavior. For dynamic analysis, we can use the strace command:

strace -s 2000 -fqqe trace=openat,execve,connect --seccomp-bpf <cmd>

where the arguments are explained below:

-s 2000To increase the limit of print strings to 2000 characters
-f To trace syscalls from forks.
qqSuppress message about attaching, detaching, and process exit staus.
e trace=openat,execve,connect
Trace only openat, execve, and connect system calls.
--seccomp-bpf
Enable seccomp-bpf filtering to improve performance.


The <cmd> can be pip install <package_name>, npm install <pkg_name>, etc. or any other commands that perform a package installation. 

You can run the python script dynamic_analysis.py to perform dynamic analysis of packages mentioned in the requirments.txt file. Please note that we perform dynamic analysis when the packages are installed and before they are actually used. It is fine to run this proof of concept locally, but when you perform dynamic analysis in the real world, it should only be done on an isolated virtual machine.

You can see that our package poc-rogue performed some malicious operations during installation and we have detected it with dynamic analysis using strace.

Some Caveats

Not everything from python code generates a useful syscall. For example, accessing the char ** environ from libc using a foreign function interface like ctypes does not involve any of the syscall that we can easily trace using strace. It is accessing the environment variables from a memory location. To precisely detect those, you can use LD_PRELOAD to trace libc symbols and functions. Also when you use pip to install a package, it will connect to PyPI or Github servers to access the package and hence you will have to whitelist those connections as you see here. In the real world, you need to careful with whitelisting as some of these entries can be abused for exfiltrating data.

Conclusion

We need to implement suitable security controls and processes into build systems with the similar effort we put into production environments. The credentials available within these automated build systems do not have Two-factor authentication enabled which is by design and makes them a lucrative target for attackers. Most real-world attacks are not always sophisticated, but rather some simple hacks that target the weakest links in your environment. The intention of this post is to spread some awareness and possibly share some ideas about the proactive security process and tooling we need in this space. 

Credits

Marc Tardif - An ex-colleague and Linux Internals expert for his insights on accessing environment variables from Memory.


  • Tags: 
  • supply chain attacks
  • supply chain
  • zero days
  • sca
  • software composition analysis
  • securing build pipeline
  • semgrep
  • strace
  • static analysis
  • dynamic analysis

Ajin Abraham

  • |
  • |
  • |

Ajin Abraham is a Security Engineer with 10+ years of experience in Application Security, Research and Engineering. He is passionate about building and maintaining open source security tools and communities. Some of his contributions to Hacker's arsenal include Mobile Security Framework (MobSF), nodejsscan, OWASP Xenotix, etc. Areas of interest include runtime security instrumentation, offensive security, web and mobile application security, code and architectural reviews, cloud-native runtime security, security tool development, security automation, breaking and fixing security products, reverse engineering, and exploit development.