Overview of BPF in Seccomp

This article aims to give a comprehensive overview of BPF programs in seccomp and what they can do, as a companion to existing documentation. Seccomp (Secure Computing) is a somewhat popular security feature in the Linux kernel that allows a thread to restrict what syscalls it is allowed to make. For this article, I will assume you are familiar with the basic usage and justification for seccomp. I reccomend the manpage and the kernel docs if you are not familiar with seccomp. In addition, there are already many good articles already published about how to use/not use seccomp for security, so this article aims to primarily cover how seccomp filters themselves run. I will also not be discussing the security implications of BPF running in the kernel.

Operations dealing with seccomp are handled through a system call of its own, SYS_seccomp (nr. 0x13D). The implementation is handled by a single C file, seccomp.c. I will link to source lines in this file during the article as needed. The function ran by the kernel when a syscall is made is __secure_computing. The entrypoint for this is here.

Overview of BPF in Seccomp

Seccomp Operations

The valid operations for the seccomp syscall are defined in this header. The SECCOMP_SET_MODE_STRICT and SECCOMP_SET_MODE_FILTER operations set the seccomp mode of the thread. Once a thread is set to strict mode, filters are not ran and the thread cannot be changed into a more permissive state. Once installed, filters cannot be removed.

It is also possible to use prctl to set the seccomp mode. Since the implementation is effectively a thin wrapper around the same handler of the SYS_seccomp syscall, I will not discuss it further.

Strict Mode

Strict mode is the “original” seccomp mode, and restricts syscalls to read, write, exit, and sigreturn (possibly other arch-specific syscalls if CONFIG_COMPAT is defined, e.g. 32-bit equivalents). The arguments to the syscall are not checked. If the syscall is not in the above allowlist, the thread is killed.

Calling the seccomp syscall with SECCOMP_SET_MODE_STRICT and empty flags/arguments immediately* turns on strict mode. Here is an example of a C program using strict mode:

#include <unistd.h>
#include <linux/seccomp.h>
#include <sys/syscall.h>
#include <sys/stat.h>
#include <stdio.h>

int main(int argc, char* argv[]) {
    printf("Installing seccomp\n");

    // Note we do not need to set NO_NEW_PRIVS with strict mode
    syscall(SYS_seccomp, SECCOMP_SET_MODE_STRICT, 0, NULL);

    printf("Writing to stdout is still okay...\n");

    // ...but not anything else
    struct stat unused;
    fstat(0, &unused); // gets sigkill-ed

    return 0;
}

* unless PT_SUSPEND_SECCOMP is set

Filter Mode

Filters are classic BPF programs that can be installed with the SECCOMP_SET_MODE_STRICT operation. See the next Filters section for all the details. Note that each call to SECCOMP_SET_MODE_FILTER installs a new filter.

Querying feature info

There are 2 operations for querying the support of seccomp actions and struct sizes, SECCOMP_GET_ACTION_AVAIL and SECCOMP_GET_NOTIF_SIZES. Seccomp actions are potential return values of a seccomp filter that correspond to different kernel actions. These can be used by a program to determine if it can install a desired seccomp filter and have the return values behave as expected, and to determine the size of seccomp notification structures. (The details of the seccomp user notification feature and actions are in other areas of this document.)

Here is an example C program showing usage of SECCOMP_GET_ACTION_AVAIL (the defined actions are taken from here):

#include <unistd.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <errno.h>

#ifndef SECCOMP_RET_USER_NOTIF
#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
#endif

const unsigned int actions[] = { SECCOMP_RET_KILL_PROCESS, SECCOMP_RET_KILL_THREAD, SECCOMP_RET_KILL, SECCOMP_RET_TRAP, SECCOMP_RET_ERRNO, SECCOMP_RET_USER_NOTIF, SECCOMP_RET_TRACE, SECCOMP_RET_LOG, SECCOMP_RET_ALLOW, 0xdeadbeef };
const char* action_names[] = { "SECCOMP_RET_KILL_PROCESS", "SECCOMP_RET_KILL_THREAD", "SECCOMP_RET_KILL", "SECCOMP_RET_TRAP", "SECCOMP_RET_ERRNO", "SECCOMP_RET_USER_NOTIF", "SECCOMP_RET_TRACE", "SECCOMP_RET_LOG", "SECCOMP_RET_ALLOW", "Invalid action 0xdeadbeef" };

const size_t action_count = sizeof(actions) / sizeof(actions[0]);

int main(int argc, char* argv[]) {
    unsigned int action = 0;
    for(size_t i = 0; i < action_count; i++) {
        printf("%s: ", action_names[i]);
        action = actions[i];
        int val = syscall(SYS_seccomp, SECCOMP_GET_ACTION_AVAIL, 0, &action);
        if(val < 0) {
            if(errno == EOPNOTSUPP) {
                printf("No\n");
            } else {
                printf("(unknown error)\n");
            }
        } else printf("Yes\n");
    }
}

Filters

Seccomp filters are generally used just to implement whitelists or blacklists of allowed syscalls, but as BPF programs they are able to perform quite powerful computation on input they are given. However, seccomp filters are not able to access the same functionality as modern eBPF (extended BPF) networking filters. They are allowed a subset of classic BPF (henceforth just BPF) instructions.

Filters are actually a tree structure where each filter is a node and a thread traverses its most recent filter (leaf) upwards through the tree. This means that new threads inherit filters from their creators but do not share new ones without the TSYNC flag.

Installing filters

To install a filter, use the SECCOMP_SET_MODE_FILTER operation to the seccomp syscall and pass flags and a struct sock_fprog program. The seccomp manpage describes some flags well, so I will not repeat them here, but there are some undocumented ones. The caller must have the CAP_SYS_ADMIN capability or set NO_NEW_PRIVS before trying to install a filter in order to avoid bypasses. The kernel has some checks to ensure a thread cannot get around this restriction.

The steps taken when installing a filter are:

Verify the filter flag combinations make sense
Prepare the filter, which involves:
- Ensure NO_NEW_PRIVS or CAP_SYS_ADMIN
- Copy the filter into kernel memory from userspace (allocating a new buffer)
- Create the actual BPF program from the filter, using bpf_prog_create_from_user, with a custom handler. The kernel JIT-s the filter in this step.
If the SECCOMP_FILTER_FLAG_NEW_LISTENER flag is set, allocate a new listener FD
Try to attach the filter, which does the following
- Check filter limits
- Try to prepare the filter cache
- Attach the filter to the tree and sync threads if needed
Assign the new seccomp mode - note that it appears to not be able to overwrite strict mode, as the mode is per thread and strict mode disallows

Undocumented filter flags

There are two filter flags not documented in the syscall manpage:

SECCOMP_FILTER_FLAG_TSYNC_ESRCH tells the kernel to return ESRCH on a TSYNC (thread sync) error to avoid conflicts with a returned file descriptor from SECCOMP_FILTER_FLAG_NEW_LISTENER.

SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV is documented in the kernel API docs and has to do with the user notifications feature:

Alternatively, at filter installation time, the SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV flag can be set. This flag makes it such that when a user notification is received by the supervisor, the notifying process will ignore non-fatal signals until the response is sent. Signals that are sent prior to the notification being received by userspace are handled normally.

In other words, this helps prevent notifications from being sent and then becoming invalid before a response is sent.

Filter limits

There are two limits to the amount of filters that can be installed to a thread. The first is the BPF_MAXINSNS limit defined in general for untrusted BPF programs. Each filter can contain this many instructions at maximum. The next is MAX_INSNS_PER_PATH defined as the maximum sum of all instructions on a thread’s path in its filter tree. As the manpage mentions, there is an additional 4 instructions per filter counted for this limit, however this is not applied to the attaching filter. Note that there is an additional 4 instruction overhead per filter due to a prologue added by the cBPF filter load code.

Here is a demonstration of the per-filter limit:

#include <unistd.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>

struct sock_filter filter_prog[4097] = { 0 };

int main(int argc, char* argv[]) {
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);

    // Initialize a filter with 4095 nops and a ret
    for(size_t i = 0; i < 4095; i++) {
        filter_prog[i].code = 0x20; // nop
        filter_prog[i].jt = 0;
        filter_prog[i].jf = 0;
        filter_prog[i].k = 0;
    }
    filter_prog[4095].code = 0x06; // ret
    filter_prog[4095].jt = 0;
    filter_prog[4095].jf = 0;
    filter_prog[4095].k = 0x7fff0000U; // allow
    struct sock_fprog filter = { 4096, filter_prog };
    printf("4096: %ld\n", syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter));
    // --> should be ok
    //
    filter_prog[4095].code = 0x20; // nop
    filter_prog[4095].k = 0;
    filter.len = 4097;
    filter_prog[4096].code = 0x06; // ret
    filter_prog[4096].jt = 0;
    filter_prog[4096].jf = 0;
    filter_prog[4096].k = 0x7fff0000U; // allow
    
    long int ret = syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter);
    printf("4097: %ld (%s)\n", ret, strerror(errno));
    // --> should fail

    return 0;
}

And a demo of the per-path limit (((4096+4+4)*7)+4036+4 == 32768):

#include <unistd.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>

struct sock_filter filter_prog[4097] = { 0 };

int main(int argc, char* argv[]) {
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);

    // Initialize a filter with 4095 nops and a ret
    for(size_t i = 0; i < 4095; i++) {
        filter_prog[i].code = 0x20; // nop
        filter_prog[i].jt = 0;
        filter_prog[i].jf = 0;
        filter_prog[i].k = 0;
    }
    filter_prog[4095].code = 0x06; // ret
    filter_prog[4095].jt = 0;
    filter_prog[4095].jf = 0;
    filter_prog[4095].k = 0x7fff0000U; // allow
    struct sock_fprog filter = { 4096, filter_prog };

    for(size_t i = 0; i < 7; i++) {
        printf("4096: %ld\n", syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter));
    }

    filter_prog[4035].code = 0x06; // ret
    filter_prog[4035].jt = 0;
    filter_prog[4035].jf = 0;
    filter_prog[4035].k = 0x7fff0000U; // allow
    filter.len = 4036;
    printf("4036: %ld\n", syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter));

    filter_prog[0].code = 0x06; // ret
    filter_prog[0].jt = 0;
    filter_prog[0].jf = 0;
    filter_prog[0].k = 0x7fff0000U; // allow
    filter.len = 1;
    printf("1: %ld\n", syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter));

    return 0;
}

Filter cache

When installing a filter, seccomp will attempt to discover and cache syscalls if possible. The cache is a bitmap where each bit corresponds to whether the filter always allows the syscall number or not. The cacheing works by trying to emulate the BPF program for each syscall number it wants to cache, and the cache starts either allowing all or copying the bitmap from the previous filter in the tree.

For each syscall number, if the bitmap already says that the syscall cannot be cached, then it will skip it. Otherwise, the filter will be emulated, and if the code doesn’t find the filter always allows the syscall then it sets the bitmap entry to uncacheable.

The emulation only understands a few BPF instructions, but it will follow jumps. The only arithmetic instruction it understands is bitwise AND on the A register. If any instructions besides jumps (conditionals are followed), a bitwise AND, and loading the A register with the syscall number/arch are made, then the emulator returns that the filter cannot be cached.

This filter is cacheable (see here for ASM syntax):

ld [0] ; A = syscall nr
and #0xffff ; A &= 0xffff
jne #0x1, ok
ret #0x0 ; terminate
ok:
ret #0x7fff0000 ; allow

On the other hand, this one is not because it uses an instruction the emulator doesn’t know despite being constant:

ld [0] ; A = syscall nr
add #0x0 ; A += 0
jne #0x1, ok
ret #0x0 ; terminate
ok:
ret #0x7fff0000 ; allow

Running filters

Whenever a system call is triggered and seccomp filters are installed, unless PTRACE_O_SUSPEND_SECCOMP is set or strict mode is enabled, filters are run on the system call. The process is as follows:

Load argument data from registers as needed.
Check the cache to see if the filter should return allow
- If so, use that as the return value
- Else, run all of the filters starting with the most recent (leaf node) and traversing up the tree. Take the return from each filter, and set the final return equal to that value if the action is a lower signed number. Do not consider the data part of the return value. Keep note of which filter is associated with the return (for the user notifications feature and logging).
Handle the final return action. (Note that this list is not in order of priority.)

Reading filters

A program can read the seccomp filters of a tracee by using the ptrace PTRACE_SECCOMP_GET_FILTER operation. This operation takes an argument specifying which seccomp filter to read, and an argument where to store the filter if the data ptrace argument is not NULL. Here is an example program dumping the first filter of a given PID.

#include <unistd.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/ptrace.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>

int main(int argc, char* argv[]) {
    if(argc < 2) exit(-1);
    pid_t pid = strtol(argv[1], NULL, 0);
    
    if(ptrace(PTRACE_ATTACH, pid, 0, 0) < 0) {
        printf("ptrace failed\n");
        printf("%s\n", strerror(errno));
        return -1;
    }
    if(waitpid(pid, NULL, 0) < 0) {
        printf("waitpid failed\n");
        printf("%s\n", strerror(errno));
        return -1;
    }
    int num_instructions = ptrace(PTRACE_SECCOMP_GET_FILTER, pid, 0 /* index */, NULL);
    if(num_instructions <= 0) {
        /* handle error - ENOENT if idx >= number of filters */ 
        printf("Could not get filter #0\n");
        printf("%s\n", strerror(errno));
        return -1;
    }
    struct sock_filter* filter = calloc(num_instructions, sizeof(struct sock_filter));
    ptrace(PTRACE_SECCOMP_GET_FILTER, pid, 0, filter);
    /* now do something with the filter */
    for(size_t i = 0; i < num_instructions; i++) {
        printf("code %x jt %x jf %x k %x\n", filter[i].code, filter[i].jt, filter[i].jf, filter[i].k);
    }
    return 0;
}

Seccomp BPF

Any BPF assembly you see in this section is written using the syntax here to be assembled with bpf_asm.

As discussed, seccomp filters are classic BPF programs, but the return values (actions) of the program and inputs have special interpretations since the context is syscalls, not network packets.

Not any valid BPF program can be ran in seccomp. Not only are there limits with the size of the programs, and limits to ensure the program does not DoS the kernel or leak memory, only some instructions can be used in seccomp specifically.

A classic BPF program has a word size of 32 bits and operates on two registers, an accumulator A and register X, and 16 words of read-write memory along with read-only packet data. It returns a single word size action and the kernel will ensure the program will always return before allowing it to be ran.

Instructions have 4 fields: the 2-byte opcode (only lower 8 bits are used in practice), 1-byte jump targets jt and jf (condition true and condition false), and 4-byte immediate value k.

Allowed instructions

The seccomp implementation provides strict restrictions on what BPF instruction can be used in a filter with a verifier pass that only allows a set allowlist of instruction codes.

Here is a list of allowed instructions with syntax/descriptions:

Code	Syntax	Meaning
0x20	`ld [off]`	Load a word at byte offset `off` in the input. Interestingly, the seccomp code rewrites this instruction into an invalid instruction `ldx [off]`. This indicates to the JIT compiler to emit special code for loading the seccomp packet data.
0x80	`ld len`	Load the length of the arguments (`sizeof(struct seccomp_data)`) into the `A` register. Rewritten into `ld #0x40`.
0x81	`ldx len`	Load the length of the arguments (`sizeof(struct seccomp_data)`) into the `X` register. Rewritten into `ldx #0x40`.
0x06	`ret #val`	Return immediate value `val`.
0x16	`ret A`	Return the value in register `A`.
0x04	`add #val`	Add immediate value `val` to register `A`.
0x0C	`add X`	Add the register `X` to register `A`.
0x14	`sub #val`	Subtract the register `A` from immediate value `val`.
0x1C	`sub X`	Subtract the register `A` from register `A`.
0x24	`mul #val`	Multiply register `A` by immediate value `val`.
0x2C	`mul X`	Multiply the register `A` by register `X`.
0x34	`div #val`	Divide register `A` by immediate value `val`.
0x3C	`div X`	Divide the register `A` by register `X`.
0x44	`or #val`	Bitwise OR the register `A` by immediate value `val`.
0x4C	`or X`	Bitwise OR the register `A` by register `X`.
0x54	`and #val`	Bitwise AND register `A` by immediate value `val`.
0x5C	`and X`	Bitwise AND the register `A` by register `X`.
0x64	`lsh #val`	Bitwise left shift register `A` by immediate value `val`.
0x6C	`lsh X`	Bitwise left shift the register `A` by register `X`.
0x74	`rsh #val`	Bitwise right shift register `A` by immediate value `val`.
0x7C	`rsh X`	Bitwise right shift the register `A` by register `X`.
0x84	`neg`	Negate the value in register `A`. (How?)
0xA4	`xor #val`	Bitwise XOR register `A` by immediate value `val`.
0xAC	`xor X`	Bitwise XOR the register `A` by register `X`.
0x00	`ld #val`	Load immediate value `val` into register `A`.
0x01	`ldx #val`	Load immediate value `val` into register `X`.
0x07	`tax`	Copy register `A` into register `X`.
0x87	`txa`	Copy register `X` into register `A`.
0x60	`ld M[off]`	Load memory at immediate address `off` into register `A`.
0x61	`ldx M[off]`	Load memory at immediate address `off` into register `A`.
0x02	`st M[off]`	Store register `A` into memory at immediate address `off`.
0x03	`stx M[off]`	Store register `X` into memory at immediate address `off`.
0x05	`ja label`	Jump to label `label` with address in the `k` value (allowing for far jumps).
0x15	`jeq #val, label`	Jump to label `label` if register `A` is equal to immediate value `value`.
0x1d	`jeq %x, label`	Jump to label `label` if register `A` is equal to register `X`.
0x25	`jgt #val, label`	Jump to label `label` if register `A` is greater than immediate value `value`.
0x2d	`jgt %x, label`	Jump to label `label` if register `A` is greater than register `X`.
0x35	`jge #val, label`	Jump to label `label` if register `A` is greater than or equal to immediate value `value`.
0x3d	`jge %x, label`	Jump to label `label` if register `A` is greater than or equal to register `X`.
0x45	`jset #val, label`	Jump to label `label` if register `A` bitwase AND immediate value `value` != 0.
0x4d	`jset %x, label`	Jump to label `label` if register `A` bitwise AND `X` != 0.

Jump instructions above are listed without a jf field, and are the same as having a jf of 0 targeting the normal next instruction. The syntax for having a false label for supported instructions is: jeq cond, true, false.

Some jump instructions in the assembly syntax, jneq, jne, jlt, jle, are compiled to the inverse condition with jf set instead of jt and an implicit jt of 0.

Verification constraints

There is code in the kernel that ensures BPF code matches tight constraints, because it is JIT-ed in the kernel and runs there. The way the BPF verifier works is complex and outside the scope of this article as plenty have already been written about it. The kernel docs also give a good description of it. For writing seccomp rules, there are a few things that one should keep in mind:

Loops are disallowed, but you can have backwards jumps if the verifier thinks control flow is effectively flat
Memory cannot be read unintialized
There cannot be unreachable code

Memory

BPF programs have access to 16 32-bit words of memory addressed by M[#]. As above, the verifier makes sure that memory cannot be loaded before being written to at least once.

Cannot load because of verifier:

ld M[0] ; not stored so invalid
ret #0

Out of bounds:

ld #0
st M[16] ; max is 15
ret #0

Storing A/X without explicitly reading is okay since it’s initialized by the prologue (syntax highlighting is broken):

st M[0]
stx M[1]
ret #0

And this is valid as you might expect:

ld #0
st M[0] ; store properly done first
ld M[0]
ret #0

Arguments

The filter has access to 16 32-bit words of read-only data (enforced because BPF doesn’t even have store instructions for packet data). The data byte addressed but only allowed to be read aligned to 32-bit boundaries. Interestingly, you can only read this data into the A register. The data is that as defined in struct seccomp_data.

For posterity, from the view of the BPF program, the arguments are as follows:

Offset	Value
0	Syscall number
4	Arch dependent value
8	Instruction pointer, lower 32 bits
12	Instruction pointer, higher 32 bits
16	Syscall argument 1, lower 32 bits
20	Syscall argument 1, higher 32 bits
24	Syscall argument 2, lower 32 bits
28	Syscall argument 2, higher 32 bits
32	Syscall argument 3, lower 32 bits
36	Syscall argument 3, higher 32 bits
40	Syscall argument 4, lower 32 bits
44	Syscall argument 4, higher 32 bits
48	Syscall argument 5, lower 32 bits
52	Syscall argument 5, higher 32 bits
56	Syscall argument 6, lower 32 bits
60	Syscall argument 6, higher 32 bits

Actions

The kernel will take different actions based on the result of running the filters. As mentioned above in the Filters section, the executed result is the minimum value of the action mask, with the data value of the lowest index filter with that action. For example, if filters return different error numbers, only the most recent’s will be actually used.

There is an accurate list of actions in the kernel docs, so I will not cover them here. One interesting behaviour is that returning SECCOMP_RET_ERRNO with a data of 0 causes the syscall to return 0 silently, otherwise -1 as one might expect with an error.

#include <unistd.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>

struct sock_filter filter_prog1[] = {
    { 0x20,  0,  0, 0000000000 }, // A = syscall nr
    { 0x15,  0,  1, 0x00001000 }, // if A != 0x1000, jump 2
    { 0x06,  0,  0, 0x00050000 }, // action = errno = 0
    { 0x06,  0,  0, 0x7fff0000 }, // action = allow
};

struct sock_filter filter_prog2[] = {
    { 0x20,  0,  0, 0000000000 }, // A = syscall nr
    { 0x15,  0,  1, 0x00001001 }, // if A != 0x1000, jump 2
    { 0x06,  0,  0, 0x00050001 }, // acction = errno = 1
    { 0x06,  0,  0, 0x7fff0000 }, // action = allow
};

int main(int argc, char* argv[]) {
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
    struct sock_fprog filter = { 4, filter_prog1 };
    syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter);
    filter.filter = filter_prog2;
    syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter);

    printf("errno=0: %ld\n", syscall(0x1000));
    printf("errno=1: %ld\n", syscall(0x1001));

    return 0;
}

In addition, a return value of SECCOMP_RET_TRACE can allow the tracer to change the returned syscall. If this happens, the filter will be ran one more time, but cannot trigger another (recursive) trace event.

vsyscalls

vsyscalls are an (relatively obselete/old) kernel feature that ran certain syscalls in userspace. The only thing about them that particularly matters for seccomp nowadays is that emulated vsyscalls are handled somewhat seperately from normal syscalls, and a tracer cannot change the syscall number (or the kernel kills the process with SIGSYS, which is better than continuing silently). vDSO, the more modern equivalent, does not do this.

Special actions

There are some actions that have more complex behaviour than killing or allowing a system call to run.

Tracing

SECCOMP_RET_TRACE sends a ptrace event PTRACE_EVENT_SECCOMP to a tracer. If no tracer exists, the syscall fails with ENOSYS. One can then change the registers and a new syscall can be triggered instead, or the arguments can be changed.

User Notifications

SECCOMP_RET_USER_NOTIF causes seccomp to send a message to userspace over a file descriptor and wait for a response. The userspace handler can either cause the syscall to be dropped with an error number, or continue as normal. The userspace handler can use ioctls to read and reply to messages. When using this feature, note that trying to filter syscalls by dereferencing argument pointers is dangerous, as a program can change the value located at that data between the time it is read by the handler and by the syscall.

Here is an example of a C program using this to log syscalls matching a certain syscall number (note that you need seccomp.h here since userspace headers may be too old to compile this):

#include <unistd.h>
#include "seccomp.h"
#include <linux/filter.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <stdio.h>
#include <pthread.h>

#define SECCOMP_FILTER_FLAG_NEW_LISTENER	(1UL << 3)

struct sock_filter filter_prog[] = {
    { 0x20,  0,  0, 0000000000}, // A = syscall nr
    { 0x15,  0,  1, 0x0000beef}, // jump 2 if A != 0xbeef
    { 0x06,  0,  0, 0x7fc0dead}, // action = notif
    { 0x06,  0,  0, 0x7fff0000}, // action = allow
};

void* watcher_f(void* arg) {
    int fd = (int)arg;
    struct seccomp_notif req = {};
    struct seccomp_notif_resp resp = {};
    while(1) {
        if(ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, &req) < 0) {
            return 0;
        }

        printf("recieved a syscall! arg[0] = %llx\n", req.data.args[0]);

        resp.id = req.id;
        resp.flags = 0;
        resp.error = -1;
        resp.val = -1;

        if(ioctl(fd, SECCOMP_IOCTL_NOTIF_SEND, &resp) < 0) {
            printf("resp ioctl failed\n");
            return 0;
        }
    }
}

int main(int argc, char* argv[]) {
    pthread_t watcher;

    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
    struct sock_fprog filter = { 4, filter_prog };
    int fd = syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &filter);
    pthread_create(&watcher, NULL, watcher_f, (void*)fd);

    syscall(0xbeef, 0x12345678, 1, 2, 3, 4, 5);

    unsigned long ret;
    pthread_join(watcher, (void**)(&ret));

    return 0;
}

One of the more interesting features in this system is that using the SECCOMP_IOCTL_NOTIF_ADDFD ioctl, the user notification handler can add a new file descriptor into the file descriptor table of the seccomp-ed thread, by duplicating one of the handler’s.

Logs

Returning SECCOMP_RET_LOG causes a syscall to be logged to the audit log after being executed. Here is an example C program showing what that looks like:

#include <unistd.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <stdio.h>

struct sock_filter filter_prog[] = {
    { 0x06,  0,  0, 0x7ffcdead }, // action = log
};

int main(int argc, char* argv[]) {
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
    struct sock_fprog filter = { 1, filter_prog };
    syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &filter);

    syscall(0xbeef, 0, 1, 2, 3, 4, 5);
    printf("Logged syscall 0xbeef...\n");

    return 0;
}

And in the logs:

[203837.418474] audit: type=1326 audit(1673903108.170:3112): auid=1000 uid=1000 gid=1000 ses=2 pid=1226 comm="log" exe="/path/to/exe" sig=0 arch=c000003e syscall=48879 compat=0 ip=0x7f86a4ec9539 code=0x7ffc0000

Note that arguments and the data part of the return are not logged anywhere.

Tools

Here are some useful tools to work with seccomp programs.

bpf_asm

An assembler provided in the kernel source tree. It translates the syntax as given in the kernel documentation into either a raw string of numbers or a C initializer list.

bpf_dbg

A BPF debugger that supports breakpoints and, in theory, mimics the kernel. It is designed for debugging with network packets and not syscall arguments, though.

seccomp-tools

A favorite of CTF players, seccomp-tools can dump, dissasemble, and emulate seccomp BPF seccomp filters. It can dissasemble raw 8-byte seccomp instructions, and assemble with its own (pretty friendly) syntax. It dumps rules by using ptrace after executing the program. You need to specify the -l flag to have it dump more than one filter.