Overview of BPF in Seccomp
This article aims to give a comprehensive overview of BPF programs in seccomp and what they can do, as a companion to existing documentation. Seccomp (Secure Computing) is a somewhat popular security feature in the Linux kernel that allows a thread to restrict what syscalls it is allowed to make. For this article, I will assume you are familiar with the basic usage and justification for seccomp. I reccomend the manpage and the kernel docs if you are not familiar with seccomp. In addition, there are already many good articles already published about how to use/not use seccomp for security, so this article aims to primarily cover how seccomp filters themselves run. I will also not be discussing the security implications of BPF running in the kernel.
Operations dealing with seccomp are handled through a system call of its own, SYS_seccomp
(nr. 0x13D). The implementation is handled by a single C file, seccomp.c
. I will link to source lines in this file during the article as needed. The function ran by the kernel when a syscall is made is __secure_computing
. The entrypoint for this is here.
Seccomp Operations
The valid operations for the seccomp syscall are defined in this header. The SECCOMP_SET_MODE_STRICT
and SECCOMP_SET_MODE_FILTER
operations set the seccomp mode of the thread. Once a thread is set to strict mode, filters are not ran and the thread cannot be changed into a more permissive state. Once installed, filters cannot be removed.
It is also possible to use prctl
to set the seccomp mode. Since the implementation is effectively a thin wrapper around the same handler of the SYS_seccomp
syscall, I will not discuss it further.
Strict Mode
Strict mode is the “original” seccomp mode, and restricts syscalls to read
, write
, exit
, and sigreturn
(possibly other arch-specific syscalls if CONFIG_COMPAT
is defined, e.g. 32-bit equivalents). The arguments to the syscall are not checked. If the syscall is not in the above allowlist, the thread is killed.
Calling the seccomp syscall with SECCOMP_SET_MODE_STRICT
and empty flags/arguments immediately* turns on strict mode. Here is an example of a C program using strict mode:
#include <unistd.h>
#include <linux/seccomp.h>
#include <sys/syscall.h>
#include <sys/stat.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
printf("Installing seccomp\n");
syscall(SYS_seccomp, SECCOMP_SET_MODE_STRICT, 0, NULL);
printf("Writing to stdout is still okay...\n");
struct stat unused;
fstat(0, &unused);
return 0;
}
* unless PT_SUSPEND_SECCOMP is set
Filter Mode
Filters are classic BPF programs that can be installed with the SECCOMP_SET_MODE_STRICT
operation. See the next Filters section for all the details. Note that each call to SECCOMP_SET_MODE_FILTER
installs a new filter.
Querying feature info
There are 2 operations for querying the support of seccomp actions and struct sizes, SECCOMP_GET_ACTION_AVAIL
and SECCOMP_GET_NOTIF_SIZES
. Seccomp actions are potential return values of a seccomp filter that correspond to different kernel actions. These can be used by a program to determine if it can install a desired seccomp filter and have the return values behave as expected, and to determine the size of seccomp notification structures. (The details of the seccomp user notification feature and actions are in other areas of this document.)
Here is an example C program showing usage of SECCOMP_GET_ACTION_AVAIL
(the defined actions are taken from here):
#include <unistd.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <errno.h>
#ifndef SECCOMP_RET_USER_NOTIF
#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
#endif
const unsigned int actions[] = { SECCOMP_RET_KILL_PROCESS, SECCOMP_RET_KILL_THREAD, SECCOMP_RET_KILL, SECCOMP_RET_TRAP, SECCOMP_RET_ERRNO, SECCOMP_RET_USER_NOTIF, SECCOMP_RET_TRACE, SECCOMP_RET_LOG, SECCOMP_RET_ALLOW, 0xdeadbeef };
const char* action_names[] = { "SECCOMP_RET_KILL_PROCESS", "SECCOMP_RET_KILL_THREAD", "SECCOMP_RET_KILL", "SECCOMP_RET_TRAP", "SECCOMP_RET_ERRNO", "SECCOMP_RET_USER_NOTIF", "SECCOMP_RET_TRACE", "SECCOMP_RET_LOG", "SECCOMP_RET_ALLOW", "Invalid action 0xdeadbeef" };
const size_t action_count = sizeof(actions) / sizeof(actions[0]);
int main(int argc, char* argv[]) {
unsigned int action = 0;
for(size_t i = 0; i < action_count; i++) {
printf("%s: ", action_names[i]);
action = actions[i];
int val = syscall(SYS_seccomp, SECCOMP_GET_ACTION_AVAIL, 0, &action);
if(val < 0) {
if(errno == EOPNOTSUPP) {
printf("No\n");
} else {
printf("(unknown error)\n");
}
} else printf("Yes\n");
}
}
Filters
Seccomp filters are generally used just to implement whitelists or blacklists of allowed syscalls, but as BPF programs they are able to perform quite powerful computation on input they are given. However, seccomp filters are not able to access the same functionality as modern eBPF (extended BPF) networking filters. They are allowed a subset of classic BPF (henceforth just BPF) instructions.
Filters are actually a tree structure where each filter is a node and a thread traverses its most recent filter (leaf) upwards through the tree. This means that new threads inherit filters from their creators but do not share new ones without the TSYNC flag.
Installing filters
To install a filter, use the SECCOMP_SET_MODE_FILTER
operation to the seccomp syscall and pass flags and a struct sock_fprog
program. The seccomp manpage describes some flags well, so I will not repeat them here, but there are some undocumented ones. The caller must have the CAP_SYS_ADMIN
capability or set NO_NEW_PRIVS
before trying to install a filter in order to avoid bypasses. The kernel has some checks to ensure a thread cannot get around this restriction.
The steps taken when installing a filter are:
- Verify the filter flag combinations make sense
- Prepare the filter, which involves:
- Ensure
NO_NEW_PRIVS
or CAP_SYS_ADMIN
- Copy the filter into kernel memory from userspace (allocating a new buffer)
- Create the actual BPF program from the filter, using
bpf_prog_create_from_user
, with a custom handler. The kernel JIT-s the filter in this step.
- If the
SECCOMP_FILTER_FLAG_NEW_LISTENER
flag is set, allocate a new listener FD
- Try to attach the filter, which does the following
- Assign the new seccomp mode - note that it appears to not be able to overwrite strict mode, as the mode is per thread and strict mode disallows
Undocumented filter flags
There are two filter flags not documented in the syscall manpage:
SECCOMP_FILTER_FLAG_TSYNC_ESRCH
tells the kernel to return ESRCH on a TSYNC (thread sync) error to avoid conflicts with a returned file descriptor from SECCOMP_FILTER_FLAG_NEW_LISTENER
.
SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV
is documented in the kernel API docs and has to do with the user notifications feature:
Alternatively, at filter installation time, the SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV flag can be set. This flag makes it such that when a user notification is received by the supervisor, the notifying process will ignore non-fatal signals until the response is sent. Signals that are sent prior to the notification being received by userspace are handled normally.
In other words, this helps prevent notifications from being sent and then becoming invalid before a response is sent.
Filter limits
There are two limits to the amount of filters that can be installed to a thread. The first is the BPF_MAXINSNS
limit defined in general for untrusted BPF programs. Each filter can contain this many instructions at maximum. The next is MAX_INSNS_PER_PATH
defined as the maximum sum of all instructions on a thread’s path in its filter tree. As the manpage mentions, there is an additional 4 instructions per filter counted for this limit, however this is not applied to the attaching filter. Note that there is an additional 4 instruction overhead per filter due to a prologue added by the cBPF filter load code.
Here is a demonstration of the per-filter limit:
#include <unistd.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
struct sock_filter filter_prog[4097] = { 0 };
int main(int argc, char* argv[]) {
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
for(size_t i = 0; i < 4095; i++) {
filter_prog[i].code = 0x20;
filter_prog[i].jt = 0;
filter_prog[i].jf = 0;
filter_prog[i].k = 0;
}
filter_prog[4095].code = 0x06;
filter_prog[4095].jt = 0;
filter_prog[4095].jf = 0;
filter_prog[4095].k = 0x7fff0000U;
struct sock_fprog filter = { 4096, filter_prog };
printf("4096: %ld\n", syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter));
filter_prog[4095].code = 0x20;
filter_prog[4095].k = 0;
filter.len = 4097;
filter_prog[4096].code = 0x06;
filter_prog[4096].jt = 0;
filter_prog[4096].jf = 0;
filter_prog[4096].k = 0x7fff0000U;
long int ret = syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter);
printf("4097: %ld (%s)\n", ret, strerror(errno));
return 0;
}
And a demo of the per-path limit (((4096+4+4)*7)+4036+4 == 32768
):
#include <unistd.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
struct sock_filter filter_prog[4097] = { 0 };
int main(int argc, char* argv[]) {
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
for(size_t i = 0; i < 4095; i++) {
filter_prog[i].code = 0x20;
filter_prog[i].jt = 0;
filter_prog[i].jf = 0;
filter_prog[i].k = 0;
}
filter_prog[4095].code = 0x06;
filter_prog[4095].jt = 0;
filter_prog[4095].jf = 0;
filter_prog[4095].k = 0x7fff0000U;
struct sock_fprog filter = { 4096, filter_prog };
for(size_t i = 0; i < 7; i++) {
printf("4096: %ld\n", syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter));
}
filter_prog[4035].code = 0x06;
filter_prog[4035].jt = 0;
filter_prog[4035].jf = 0;
filter_prog[4035].k = 0x7fff0000U;
filter.len = 4036;
printf("4036: %ld\n", syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter));
filter_prog[0].code = 0x06;
filter_prog[0].jt = 0;
filter_prog[0].jf = 0;
filter_prog[0].k = 0x7fff0000U;
filter.len = 1;
printf("1: %ld\n", syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter));
return 0;
}
Filter cache
When installing a filter, seccomp will attempt to discover and cache syscalls if possible. The cache is a bitmap where each bit corresponds to whether the filter always allows the syscall number or not. The cacheing works by trying to emulate the BPF program for each syscall number it wants to cache, and the cache starts either allowing all or copying the bitmap from the previous filter in the tree.
For each syscall number, if the bitmap already says that the syscall cannot be cached, then it will skip it. Otherwise, the filter will be emulated, and if the code doesn’t find the filter always allows the syscall then it sets the bitmap entry to uncacheable.
The emulation only understands a few BPF instructions, but it will follow jumps. The only arithmetic instruction it understands is bitwise AND on the A register. If any instructions besides jumps (conditionals are followed), a bitwise AND, and loading the A register with the syscall number/arch are made, then the emulator returns that the filter cannot be cached.
This filter is cacheable (see here for ASM syntax):
ld [0]
and #0xffff
jne #0x1, ok
ret #0x0
ok:
ret #0x7fff0000
On the other hand, this one is not because it uses an instruction the emulator doesn’t know despite being constant:
ld [0]
add #0x0
jne #0x1, ok
ret #0x0
ok:
ret #0x7fff0000
Running filters
Whenever a system call is triggered and seccomp filters are installed, unless PTRACE_O_SUSPEND_SECCOMP
is set or strict mode is enabled, filters are run on the system call. The process is as follows:
- Load argument data from registers as needed.
- Check the cache to see if the filter should return allow
- If so, use that as the return value
- Else, run all of the filters starting with the most recent (leaf node) and traversing up the tree. Take the return from each filter, and set the final return equal to that value if the action is a lower signed number. Do not consider the data part of the return value. Keep note of which filter is associated with the return (for the user notifications feature and logging).
- Handle the final return action. (Note that this list is not in order of priority.)
Reading filters
A program can read the seccomp filters of a tracee by using the ptrace PTRACE_SECCOMP_GET_FILTER
operation. This operation takes an argument specifying which seccomp filter to read, and an argument where to store the filter if the data
ptrace argument is not NULL. Here is an example program dumping the first filter of a given PID.
#include <unistd.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/ptrace.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
int main(int argc, char* argv[]) {
if(argc < 2) exit(-1);
pid_t pid = strtol(argv[1], NULL, 0);
if(ptrace(PTRACE_ATTACH, pid, 0, 0) < 0) {
printf("ptrace failed\n");
printf("%s\n", strerror(errno));
return -1;
}
if(waitpid(pid, NULL, 0) < 0) {
printf("waitpid failed\n");
printf("%s\n", strerror(errno));
return -1;
}
int num_instructions = ptrace(PTRACE_SECCOMP_GET_FILTER, pid, 0 , NULL);
if(num_instructions <= 0) {
printf("Could not get filter #0\n");
printf("%s\n", strerror(errno));
return -1;
}
struct sock_filter* filter = calloc(num_instructions, sizeof(struct sock_filter));
ptrace(PTRACE_SECCOMP_GET_FILTER, pid, 0, filter);
for(size_t i = 0; i < num_instructions; i++) {
printf("code %x jt %x jf %x k %x\n", filter[i].code, filter[i].jt, filter[i].jf, filter[i].k);
}
return 0;
}
Seccomp BPF
- Any BPF assembly you see in this section is written using the syntax here to be assembled with bpf_asm.
As discussed, seccomp filters are classic BPF programs, but the return values (actions) of the program and inputs have special interpretations since the context is syscalls, not network packets.
Not any valid BPF program can be ran in seccomp. Not only are there limits with the size of the programs, and limits to ensure the program does not DoS the kernel or leak memory, only some instructions can be used in seccomp specifically.
A classic BPF program has a word size of 32 bits and operates on two registers, an accumulator A
and register X
, and 16 words of read-write memory along with read-only packet data. It returns a single word size action and the kernel will ensure the program will always return before allowing it to be ran.
Instructions have 4 fields: the 2-byte opcode (only lower 8 bits are used in practice), 1-byte jump targets jt
and jf
(condition true and condition false), and 4-byte immediate value k
.
Allowed instructions
The seccomp implementation provides strict restrictions on what BPF instruction can be used in a filter with a verifier pass that only allows a set allowlist of instruction codes.
Here is a list of allowed instructions with syntax/descriptions:
Code |
Syntax |
Meaning |
0x20 |
ld [off] |
Load a word at byte offset off in the input. Interestingly, the seccomp code rewrites this instruction into an invalid instruction ldx [off] . This indicates to the JIT compiler to emit special code for loading the seccomp packet data. |
0x80 |
ld len |
Load the length of the arguments (sizeof(struct seccomp_data) ) into the A register. Rewritten into ld #0x40 . |
0x81 |
ldx len |
Load the length of the arguments (sizeof(struct seccomp_data) ) into the X register. Rewritten into ldx #0x40 . |
0x06 |
ret #val |
Return immediate value val . |
0x16 |
ret A |
Return the value in register A . |
0x04 |
add #val |
Add immediate value val to register A . |
0x0C |
add X |
Add the register X to register A . |
0x14 |
sub #val |
Subtract the register A from immediate value val . |
0x1C |
sub X |
Subtract the register A from register A . |
0x24 |
mul #val |
Multiply register A by immediate value val . |
0x2C |
mul X |
Multiply the register A by register X . |
0x34 |
div #val |
Divide register A by immediate value val . |
0x3C |
div X |
Divide the register A by register X . |
0x44 |
or #val |
Bitwise OR the register A by immediate value val . |
0x4C |
or X |
Bitwise OR the register A by register X . |
0x54 |
and #val |
Bitwise AND register A by immediate value val . |
0x5C |
and X |
Bitwise AND the register A by register X . |
0x64 |
lsh #val |
Bitwise left shift register A by immediate value val . |
0x6C |
lsh X |
Bitwise left shift the register A by register X . |
0x74 |
rsh #val |
Bitwise right shift register A by immediate value val . |
0x7C |
rsh X |
Bitwise right shift the register A by register X . |
0x84 |
neg |
Negate the value in register A . (How?) |
0xA4 |
xor #val |
Bitwise XOR register A by immediate value val . |
0xAC |
xor X |
Bitwise XOR the register A by register X . |
0x00 |
ld #val |
Load immediate value val into register A . |
0x01 |
ldx #val |
Load immediate value val into register X . |
0x07 |
tax |
Copy register A into register X . |
0x87 |
txa |
Copy register X into register A . |
0x60 |
ld M[off] |
Load memory at immediate address off into register A . |
0x61 |
ldx M[off] |
Load memory at immediate address off into register A . |
0x02 |
st M[off] |
Store register A into memory at immediate address off . |
0x03 |
stx M[off] |
Store register X into memory at immediate address off . |
0x05 |
ja label |
Jump to label label with address in the k value (allowing for far jumps). |
0x15 |
jeq #val, label |
Jump to label label if register A is equal to immediate value value . |
0x1d |
jeq %x, label |
Jump to label label if register A is equal to register X . |
0x25 |
jgt #val, label |
Jump to label label if register A is greater than immediate value value . |
0x2d |
jgt %x, label |
Jump to label label if register A is greater than register X . |
0x35 |
jge #val, label |
Jump to label label if register A is greater than or equal to immediate value value . |
0x3d |
jge %x, label |
Jump to label label if register A is greater than or equal to register X . |
0x45 |
jset #val, label |
Jump to label label if register A bitwase AND immediate value value != 0. |
0x4d |
jset %x, label |
Jump to label label if register A bitwise AND X != 0. |
Jump instructions above are listed without a jf
field, and are the same as having a jf
of 0 targeting the normal next instruction. The syntax for having a false label for supported instructions is: jeq cond, true, false
.
Some jump instructions in the assembly syntax, jneq, jne, jlt, jle
, are compiled to the inverse condition with jf
set instead of jt
and an implicit jt
of 0.
Verification constraints
There is code in the kernel that ensures BPF code matches tight constraints, because it is JIT-ed in the kernel and runs there. The way the BPF verifier works is complex and outside the scope of this article as plenty have already been written about it. The kernel docs also give a good description of it. For writing seccomp rules, there are a few things that one should keep in mind:
- Loops are disallowed, but you can have backwards jumps if the verifier thinks control flow is effectively flat
- Memory cannot be read unintialized
- There cannot be unreachable code
Memory
BPF programs have access to 16 32-bit words of memory addressed by M[#]
. As above, the verifier makes sure that memory cannot be loaded before being written to at least once.
Cannot load because of verifier:
ld M[0]
ret #0
Out of bounds:
ld #0
st M[16]
ret #0
Storing A/X without explicitly reading is okay since it’s initialized by the prologue (syntax highlighting is broken):
st M[0]
stx M[1]
ret #0
And this is valid as you might expect:
ld #0
st M[0]
ld M[0]
ret #0
Arguments
The filter has access to 16 32-bit words of read-only data (enforced because BPF doesn’t even have store instructions for packet data). The data byte addressed but only allowed to be read aligned to 32-bit boundaries. Interestingly, you can only read this data into the A register. The data is that as defined in struct seccomp_data
.
For posterity, from the view of the BPF program, the arguments are as follows:
Offset |
Value |
0 |
Syscall number |
4 |
Arch dependent value |
8 |
Instruction pointer, lower 32 bits |
12 |
Instruction pointer, higher 32 bits |
16 |
Syscall argument 1, lower 32 bits |
20 |
Syscall argument 1, higher 32 bits |
24 |
Syscall argument 2, lower 32 bits |
28 |
Syscall argument 2, higher 32 bits |
32 |
Syscall argument 3, lower 32 bits |
36 |
Syscall argument 3, higher 32 bits |
40 |
Syscall argument 4, lower 32 bits |
44 |
Syscall argument 4, higher 32 bits |
48 |
Syscall argument 5, lower 32 bits |
52 |
Syscall argument 5, higher 32 bits |
56 |
Syscall argument 6, lower 32 bits |
60 |
Syscall argument 6, higher 32 bits |
Actions
The kernel will take different actions based on the result of running the filters. As mentioned above in the Filters section, the executed result is the minimum value of the action mask, with the data value of the lowest index filter with that action. For example, if filters return different error numbers, only the most recent’s will be actually used.
There is an accurate list of actions in the kernel docs, so I will not cover them here. One interesting behaviour is that returning SECCOMP_RET_ERRNO
with a data of 0
causes the syscall to return 0
silently, otherwise -1
as one might expect with an error.
#include <unistd.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
struct sock_filter filter_prog1[] = {
{ 0x20, 0, 0, 0000000000 },
{ 0x15, 0, 1, 0x00001000 },
{ 0x06, 0, 0, 0x00050000 },
{ 0x06, 0, 0, 0x7fff0000 },
};
struct sock_filter filter_prog2[] = {
{ 0x20, 0, 0, 0000000000 },
{ 0x15, 0, 1, 0x00001001 },
{ 0x06, 0, 0, 0x00050001 },
{ 0x06, 0, 0, 0x7fff0000 },
};
int main(int argc, char* argv[]) {
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
struct sock_fprog filter = { 4, filter_prog1 };
syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter);
filter.filter = filter_prog2;
syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter);
printf("errno=0: %ld\n", syscall(0x1000));
printf("errno=1: %ld\n", syscall(0x1001));
return 0;
}
In addition, a return value of SECCOMP_RET_TRACE
can allow the tracer to change the returned syscall. If this happens, the filter will be ran one more time, but cannot trigger another (recursive) trace event.
vsyscalls
vsyscalls are an (relatively obselete/old) kernel feature that ran certain syscalls in userspace. The only thing about them that particularly matters for seccomp nowadays is that emulated vsyscalls are handled somewhat seperately from normal syscalls, and a tracer cannot change the syscall number (or the kernel kills the process with SIGSYS, which is better than continuing silently). vDSO, the more modern equivalent, does not do this.
Special actions
There are some actions that have more complex behaviour than killing or allowing a system call to run.
Tracing
SECCOMP_RET_TRACE
sends a ptrace event PTRACE_EVENT_SECCOMP
to a tracer. If no tracer exists, the syscall fails with ENOSYS. One can then change the registers and a new syscall can be triggered instead, or the arguments can be changed.
User Notifications
SECCOMP_RET_USER_NOTIF
causes seccomp to send a message to userspace over a file descriptor and wait for a response. The userspace handler can either cause the syscall to be dropped with an error number, or continue as normal. The userspace handler can use ioctls to read and reply to messages. When using this feature, note that trying to filter syscalls by dereferencing argument pointers is dangerous, as a program can change the value located at that data between the time it is read by the handler and by the syscall.
Here is an example of a C program using this to log syscalls matching a certain syscall number (note that you need seccomp.h here since userspace headers may be too old to compile this):
#include <unistd.h>
#include "seccomp.h"
#include <linux/filter.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <stdio.h>
#include <pthread.h>
#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
struct sock_filter filter_prog[] = {
{ 0x20, 0, 0, 0000000000},
{ 0x15, 0, 1, 0x0000beef},
{ 0x06, 0, 0, 0x7fc0dead},
{ 0x06, 0, 0, 0x7fff0000},
};
void* watcher_f(void* arg) {
int fd = (int)arg;
struct seccomp_notif req = {};
struct seccomp_notif_resp resp = {};
while(1) {
if(ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, &req) < 0) {
return 0;
}
printf("recieved a syscall! arg[0] = %llx\n", req.data.args[0]);
resp.id = req.id;
resp.flags = 0;
resp.error = -1;
resp.val = -1;
if(ioctl(fd, SECCOMP_IOCTL_NOTIF_SEND, &resp) < 0) {
printf("resp ioctl failed\n");
return 0;
}
}
}
int main(int argc, char* argv[]) {
pthread_t watcher;
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
struct sock_fprog filter = { 4, filter_prog };
int fd = syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &filter);
pthread_create(&watcher, NULL, watcher_f, (void*)fd);
syscall(0xbeef, 0x12345678, 1, 2, 3, 4, 5);
unsigned long ret;
pthread_join(watcher, (void**)(&ret));
return 0;
}
One of the more interesting features in this system is that using the SECCOMP_IOCTL_NOTIF_ADDFD
ioctl, the user notification handler can add a new file descriptor into the file descriptor table of the seccomp-ed thread, by duplicating one of the handler’s.
Logs
Returning SECCOMP_RET_LOG
causes a syscall to be logged to the audit log after being executed. Here is an example C program showing what that looks like:
#include <unistd.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <stdio.h>
struct sock_filter filter_prog[] = {
{ 0x06, 0, 0, 0x7ffcdead },
};
int main(int argc, char* argv[]) {
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
struct sock_fprog filter = { 1, filter_prog };
syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter);
syscall(0xbeef, 0, 1, 2, 3, 4, 5);
printf("Logged syscall 0xbeef...\n");
return 0;
}
And in the logs:
[203837.418474] audit: type=1326 audit(1673903108.170:3112): auid=1000 uid=1000 gid=1000 ses=2 pid=1226 comm="log" exe="/path/to/exe" sig=0 arch=c000003e syscall=48879 compat=0 ip=0x7f86a4ec9539 code=0x7ffc0000
Note that arguments and the data part of the return are not logged anywhere.
Here are some useful tools to work with seccomp programs.
bpf_asm
An assembler provided in the kernel source tree. It translates the syntax as given in the kernel documentation into either a raw string of numbers or a C initializer list.
bpf_dbg
A BPF debugger that supports breakpoints and, in theory, mimics the kernel. It is designed for debugging with network packets and not syscall arguments, though.
A favorite of CTF players, seccomp-tools can dump, dissasemble, and emulate seccomp BPF seccomp filters. It can dissasemble raw 8-byte seccomp instructions, and assemble with its own (pretty friendly) syntax. It dumps rules by using ptrace
after executing the program. You need to specify the -l
flag to have it dump more than one filter.
Overview of BPF in Seccomp
This article aims to give a comprehensive overview of BPF programs in seccomp and what they can do, as a companion to existing documentation. Seccomp (Secure Computing) is a somewhat popular security feature in the Linux kernel that allows a thread to restrict what syscalls it is allowed to make. For this article, I will assume you are familiar with the basic usage and justification for seccomp. I reccomend the manpage and the kernel docs if you are not familiar with seccomp. In addition, there are already many good articles already published about how to use/not use seccomp for security, so this article aims to primarily cover how seccomp filters themselves run. I will also not be discussing the security implications of BPF running in the kernel.
Operations dealing with seccomp are handled through a system call of its own,
SYS_seccomp
(nr. 0x13D). The implementation is handled by a single C file,seccomp.c
. I will link to source lines in this file during the article as needed. The function ran by the kernel when a syscall is made is__secure_computing
. The entrypoint for this is here.Seccomp Operations
The valid operations for the seccomp syscall are defined in this header. The
SECCOMP_SET_MODE_STRICT
andSECCOMP_SET_MODE_FILTER
operations set the seccomp mode of the thread. Once a thread is set to strict mode, filters are not ran and the thread cannot be changed into a more permissive state. Once installed, filters cannot be removed.It is also possible to use
prctl
to set the seccomp mode. Since the implementation is effectively a thin wrapper around the same handler of theSYS_seccomp
syscall, I will not discuss it further.Strict Mode
Strict mode is the “original” seccomp mode, and restricts syscalls to
read
,write
,exit
, andsigreturn
(possibly other arch-specific syscalls ifCONFIG_COMPAT
is defined, e.g. 32-bit equivalents). The arguments to the syscall are not checked. If the syscall is not in the above allowlist, the thread is killed.Calling the seccomp syscall with
SECCOMP_SET_MODE_STRICT
and empty flags/arguments immediately* turns on strict mode. Here is an example of a C program using strict mode:* unless PT_SUSPEND_SECCOMP is set
Filter Mode
Filters are classic BPF programs that can be installed with the
SECCOMP_SET_MODE_STRICT
operation. See the next Filters section for all the details. Note that each call toSECCOMP_SET_MODE_FILTER
installs a new filter.Querying feature info
There are 2 operations for querying the support of seccomp actions and struct sizes,
SECCOMP_GET_ACTION_AVAIL
andSECCOMP_GET_NOTIF_SIZES
. Seccomp actions are potential return values of a seccomp filter that correspond to different kernel actions. These can be used by a program to determine if it can install a desired seccomp filter and have the return values behave as expected, and to determine the size of seccomp notification structures. (The details of the seccomp user notification feature and actions are in other areas of this document.)Here is an example C program showing usage of
SECCOMP_GET_ACTION_AVAIL
(the defined actions are taken from here):Filters
Seccomp filters are generally used just to implement whitelists or blacklists of allowed syscalls, but as BPF programs they are able to perform quite powerful computation on input they are given. However, seccomp filters are not able to access the same functionality as modern eBPF (extended BPF) networking filters. They are allowed a subset of classic BPF (henceforth just BPF) instructions.
Filters are actually a tree structure where each filter is a node and a thread traverses its most recent filter (leaf) upwards through the tree. This means that new threads inherit filters from their creators but do not share new ones without the TSYNC flag.
Installing filters
To install a filter, use the
SECCOMP_SET_MODE_FILTER
operation to the seccomp syscall and pass flags and astruct sock_fprog
program. The seccomp manpage describes some flags well, so I will not repeat them here, but there are some undocumented ones. The caller must have theCAP_SYS_ADMIN
capability or setNO_NEW_PRIVS
before trying to install a filter in order to avoid bypasses. The kernel has some checks to ensure a thread cannot get around this restriction.The steps taken when installing a filter are:
NO_NEW_PRIVS
orCAP_SYS_ADMIN
bpf_prog_create_from_user
, with a custom handler. The kernel JIT-s the filter in this step.SECCOMP_FILTER_FLAG_NEW_LISTENER
flag is set, allocate a new listener FDUndocumented filter flags
There are two filter flags not documented in the syscall manpage:
SECCOMP_FILTER_FLAG_TSYNC_ESRCH
tells the kernel to return ESRCH on a TSYNC (thread sync) error to avoid conflicts with a returned file descriptor fromSECCOMP_FILTER_FLAG_NEW_LISTENER
.SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV
is documented in the kernel API docs and has to do with the user notifications feature:In other words, this helps prevent notifications from being sent and then becoming invalid before a response is sent.
Filter limits
There are two limits to the amount of filters that can be installed to a thread. The first is the
BPF_MAXINSNS
limit defined in general for untrusted BPF programs. Each filter can contain this many instructions at maximum. The next isMAX_INSNS_PER_PATH
defined as the maximum sum of all instructions on a thread’s path in its filter tree. As the manpage mentions, there is an additional 4 instructions per filter counted for this limit, however this is not applied to the attaching filter. Note that there is an additional 4 instruction overhead per filter due to a prologue added by the cBPF filter load code.Here is a demonstration of the per-filter limit:
And a demo of the per-path limit (
((4096+4+4)*7)+4036+4 == 32768
):Filter cache
When installing a filter, seccomp will attempt to discover and cache syscalls if possible. The cache is a bitmap where each bit corresponds to whether the filter always allows the syscall number or not. The cacheing works by trying to emulate the BPF program for each syscall number it wants to cache, and the cache starts either allowing all or copying the bitmap from the previous filter in the tree.
For each syscall number, if the bitmap already says that the syscall cannot be cached, then it will skip it. Otherwise, the filter will be emulated, and if the code doesn’t find the filter always allows the syscall then it sets the bitmap entry to uncacheable.
The emulation only understands a few BPF instructions, but it will follow jumps. The only arithmetic instruction it understands is bitwise AND on the A register. If any instructions besides jumps (conditionals are followed), a bitwise AND, and loading the A register with the syscall number/arch are made, then the emulator returns that the filter cannot be cached.
This filter is cacheable (see here for ASM syntax):
ld [0] ; A = syscall nr and #0xffff ; A &= 0xffff jne #0x1, ok ret #0x0 ; terminate ok: ret #0x7fff0000 ; allow
On the other hand, this one is not because it uses an instruction the emulator doesn’t know despite being constant:
ld [0] ; A = syscall nr add #0x0 ; A += 0 jne #0x1, ok ret #0x0 ; terminate ok: ret #0x7fff0000 ; allow
Running filters
Whenever a system call is triggered and seccomp filters are installed, unless
PTRACE_O_SUSPEND_SECCOMP
is set or strict mode is enabled, filters are run on the system call. The process is as follows:Reading filters
A program can read the seccomp filters of a tracee by using the ptrace
PTRACE_SECCOMP_GET_FILTER
operation. This operation takes an argument specifying which seccomp filter to read, and an argument where to store the filter if thedata
ptrace argument is not NULL. Here is an example program dumping the first filter of a given PID.Seccomp BPF
As discussed, seccomp filters are classic BPF programs, but the return values (actions) of the program and inputs have special interpretations since the context is syscalls, not network packets.
Not any valid BPF program can be ran in seccomp. Not only are there limits with the size of the programs, and limits to ensure the program does not DoS the kernel or leak memory, only some instructions can be used in seccomp specifically.
A classic BPF program has a word size of 32 bits and operates on two registers, an accumulator
A
and registerX
, and 16 words of read-write memory along with read-only packet data. It returns a single word size action and the kernel will ensure the program will always return before allowing it to be ran.Instructions have 4 fields: the 2-byte opcode (only lower 8 bits are used in practice), 1-byte jump targets
jt
andjf
(condition true and condition false), and 4-byte immediate valuek
.Allowed instructions
The seccomp implementation provides strict restrictions on what BPF instruction can be used in a filter with a verifier pass that only allows a set allowlist of instruction codes.
Here is a list of allowed instructions with syntax/descriptions:
ld [off]
off
in the input. Interestingly, the seccomp code rewrites this instruction into an invalid instructionldx [off]
. This indicates to the JIT compiler to emit special code for loading the seccomp packet data.ld len
sizeof(struct seccomp_data)
) into theA
register. Rewritten intold #0x40
.ldx len
sizeof(struct seccomp_data)
) into theX
register. Rewritten intoldx #0x40
.ret #val
val
.ret A
A
.add #val
val
to registerA
.add X
X
to registerA
.sub #val
A
from immediate valueval
.sub X
A
from registerA
.mul #val
A
by immediate valueval
.mul X
A
by registerX
.div #val
A
by immediate valueval
.div X
A
by registerX
.or #val
A
by immediate valueval
.or X
A
by registerX
.and #val
A
by immediate valueval
.and X
A
by registerX
.lsh #val
A
by immediate valueval
.lsh X
A
by registerX
.rsh #val
A
by immediate valueval
.rsh X
A
by registerX
.neg
A
. (How?)xor #val
A
by immediate valueval
.xor X
A
by registerX
.ld #val
val
into registerA
.ldx #val
val
into registerX
.tax
A
into registerX
.txa
X
into registerA
.ld M[off]
off
into registerA
.ldx M[off]
off
into registerA
.st M[off]
A
into memory at immediate addressoff
.stx M[off]
X
into memory at immediate addressoff
.ja label
label
with address in thek
value (allowing for far jumps).jeq #val, label
label
if registerA
is equal to immediate valuevalue
.jeq %x, label
label
if registerA
is equal to registerX
.jgt #val, label
label
if registerA
is greater than immediate valuevalue
.jgt %x, label
label
if registerA
is greater than registerX
.jge #val, label
label
if registerA
is greater than or equal to immediate valuevalue
.jge %x, label
label
if registerA
is greater than or equal to registerX
.jset #val, label
label
if registerA
bitwase AND immediate valuevalue
!= 0.jset %x, label
label
if registerA
bitwise ANDX
!= 0.Jump instructions above are listed without a
jf
field, and are the same as having ajf
of 0 targeting the normal next instruction. The syntax for having a false label for supported instructions is:jeq cond, true, false
.Some jump instructions in the assembly syntax,
jneq, jne, jlt, jle
, are compiled to the inverse condition withjf
set instead ofjt
and an implicitjt
of 0.Verification constraints
There is code in the kernel that ensures BPF code matches tight constraints, because it is JIT-ed in the kernel and runs there. The way the BPF verifier works is complex and outside the scope of this article as plenty have already been written about it. The kernel docs also give a good description of it. For writing seccomp rules, there are a few things that one should keep in mind:
Memory
BPF programs have access to 16 32-bit words of memory addressed by
M[#]
. As above, the verifier makes sure that memory cannot be loaded before being written to at least once.Cannot load because of verifier:
ld M[0] ; not stored so invalid ret #0
Out of bounds:
ld #0 st M[16] ; max is 15 ret #0
Storing A/X without explicitly reading is okay since it’s initialized by the prologue (syntax highlighting is broken):
st M[0] stx M[1] ret #0
And this is valid as you might expect:
ld #0 st M[0] ; store properly done first ld M[0] ret #0
Arguments
The filter has access to 16 32-bit words of read-only data (enforced because BPF doesn’t even have store instructions for packet data). The data byte addressed but only allowed to be read aligned to 32-bit boundaries. Interestingly, you can only read this data into the A register. The data is that as defined in
struct seccomp_data
.For posterity, from the view of the BPF program, the arguments are as follows:
Actions
The kernel will take different actions based on the result of running the filters. As mentioned above in the Filters section, the executed result is the minimum value of the action mask, with the data value of the lowest index filter with that action. For example, if filters return different error numbers, only the most recent’s will be actually used.
There is an accurate list of actions in the kernel docs, so I will not cover them here. One interesting behaviour is that returning
SECCOMP_RET_ERRNO
with a data of0
causes the syscall to return0
silently, otherwise-1
as one might expect with an error.In addition, a return value of
SECCOMP_RET_TRACE
can allow the tracer to change the returned syscall. If this happens, the filter will be ran one more time, but cannot trigger another (recursive) trace event.vsyscalls
vsyscalls are an (relatively obselete/old) kernel feature that ran certain syscalls in userspace. The only thing about them that particularly matters for seccomp nowadays is that emulated vsyscalls are handled somewhat seperately from normal syscalls, and a tracer cannot change the syscall number (or the kernel kills the process with SIGSYS, which is better than continuing silently). vDSO, the more modern equivalent, does not do this.
Special actions
There are some actions that have more complex behaviour than killing or allowing a system call to run.
Tracing
SECCOMP_RET_TRACE
sends a ptrace eventPTRACE_EVENT_SECCOMP
to a tracer. If no tracer exists, the syscall fails with ENOSYS. One can then change the registers and a new syscall can be triggered instead, or the arguments can be changed.User Notifications
SECCOMP_RET_USER_NOTIF
causes seccomp to send a message to userspace over a file descriptor and wait for a response. The userspace handler can either cause the syscall to be dropped with an error number, or continue as normal. The userspace handler can use ioctls to read and reply to messages. When using this feature, note that trying to filter syscalls by dereferencing argument pointers is dangerous, as a program can change the value located at that data between the time it is read by the handler and by the syscall.Here is an example of a C program using this to log syscalls matching a certain syscall number (note that you need seccomp.h here since userspace headers may be too old to compile this):
One of the more interesting features in this system is that using the
SECCOMP_IOCTL_NOTIF_ADDFD
ioctl, the user notification handler can add a new file descriptor into the file descriptor table of the seccomp-ed thread, by duplicating one of the handler’s.Logs
Returning
SECCOMP_RET_LOG
causes a syscall to be logged to the audit log after being executed. Here is an example C program showing what that looks like:And in the logs:
Note that arguments and the data part of the return are not logged anywhere.
Tools
Here are some useful tools to work with seccomp programs.
bpf_asm
An assembler provided in the kernel source tree. It translates the syntax as given in the kernel documentation into either a raw string of numbers or a C initializer list.
bpf_dbg
A BPF debugger that supports breakpoints and, in theory, mimics the kernel. It is designed for debugging with network packets and not syscall arguments, though.
seccomp-tools
A favorite of CTF players, seccomp-tools can dump, dissasemble, and emulate seccomp BPF seccomp filters. It can dissasemble raw 8-byte seccomp instructions, and assemble with its own (pretty friendly) syntax. It dumps rules by using
ptrace
after executing the program. You need to specify the-l
flag to have it dump more than one filter.