- Published on
barco: Linux Containers From Scratch in C.
- Name
- Luca Cavallin
barco is a project I worked on to learn more about Linux containers and the Linux kernel. It's a simple implementation of a container runtime in C, which I wrote from scratch (based on other guides on the Internet) using just C, libseccomp
for seccomp filters, libcap
for container capabilities, libcuni1
for unit tests with CUnit, argtable
for handling the CLI and another third-party library for logging. It's not meant to be used in production, but rather as a learning tool.
Linux containers are made up by a set of Linux kernel features:
namespaces
: are used to group kernel objects into different sets that can be accessed by specific process trees. There are different types ofnamespaces
, for example,thePID
namespace is used to isolate the process tree, while thenetwork
namespace is used to isolate the network stack.seccomp
: a Linux Kernel mechanism used to limit the system calls that a process can make (handled via syscalls)capabilities
: a Linux Kernel mechanism used to set limits on what the root user can do in the container (handled via syscalls)cgroups
: are used to limit the resources (e.g. memory, disk I/O, CPU-tme) that a process can use (handled via cgroupfs)
barco uses all of these features to create a container that is isolated from the host system. It's a very simple implementation, but it's enough to understand how containers work.
barco
can be used to run bin/sh .
from the /
directory as root
(-u 0) with the following command (optional -v
for verbose output):
$ sudo ./bin/barco -u 0 -m / -c /bin/sh -a . [-v]
22:08:41 INFO ./src/barco.c:96: initializing socket pair...
22:08:41 INFO ./src/barco.c:103: setting socket flags...
22:08:41 INFO ./src/barco.c:112: initializing container stack...
22:08:41 INFO ./src/barco.c:120: initializing container...
22:08:41 INFO ./src/barco.c:131: initializing cgroups...
22:08:41 INFO ./src/cgroups.c:73: setting memory.max to 1G...
22:08:41 INFO ./src/cgroups.c:73: setting cpu.weight to 256...
22:08:41 INFO ./src/cgroups.c:73: setting pids.max to 64...
22:08:41 INFO ./src/cgroups.c:73: setting cgroup.procs to 1458...
22:08:41 INFO ./src/barco.c:139: configuring user namespace...
22:08:41 INFO ./src/barco.c:147: waiting for container to exit...
22:08:41 INFO ./src/container.c:43: ### BARCONTAINER STARTING - type 'exit' to quit ###
# ls
bin home lib32 media root sys vmlinuz
boot initrd.img lib64 mnt run tmp vmlinuz.old
dev initrd.img.old libx32 opt sbin usr
etc lib lost+found proc srv var
# echo "i am a container"
i am a container
# exit
22:08:55 INFO ./src/barco.c:153: freeing resources...
22:08:55 INFO ./src/barco.c:168: so long and thanks for all the fish
Development
If you want to build and run barco
from source, you can do so by cloning the repository and following the instructions in the README file. In short, barco
provides configuration for development on GitHub Codespaces using Visual Studio Code and the included Makefile
can be used to build and run the project as follows:
$ sudo apt install -y make
$ make build
There are also other targets in the Makefile
that can be used to run the unit tests, build the documentation, run the formatter and the linter, etc (most of the tools are native to the Clang compiler). Furthermore, while working on barco
I did investigate best practices for the structure of C projects and I adopted the following:
├── .devcontainer - configuration for GitHub Codespaces
├── .github - configuration GitHub Actions and other GitHub features
├── .vscode - configuration for Visual Studio Code
├── bin - the executable (created by make)
├── build - intermediate build files e.g. *.o (created by make)
├── docs - documentation
├── include - header files
├── lib - third-party libraries
├── scripts - scripts for setup and other tasks
├── src - C source files
│ ├── barco.c - (main) Entry point for the CLI
│ └── *.c
├── tests - contains tests
├── .clang-format - configuration for the formatter
├── .clang-tidy - configuration for the linter
├── .gitignore
├── LICENSE
├── Makefile
└── README.md
How does it work?
The barco
executable is the entry point for the CLI. It's responsible for parsing the CLI arguments, setting up the container and running the container process: it all starts at barco.c, where I used argtable
to parse the CLI arguments and set up, start and ultimately cleanup the container and other resources. The first two steps towards running a container are creating a pair of sockets (to communicate with the container process) and initializing the container stack (to set up the container process).
The Call to container_init
After the initial setup, the container_init
function, defined in container.c, is called to start the container process. The function is relatively simple, and all it does it calling the clone
system call with a function to run (container_start
), stack configuration, and the appropriate flags (CLONE_NEWNS | CLONE_NEWCGROUP | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS
allow some control over mounts, pids, IPC data structures, network devices and hostname) to create the container process.
The container_start Function
The resulting process is a child of the barco
process, and it's the one that will run the container. The container_start
function is the entry point for the container process, and it's defined in container.c. It's responsible for setting up the container, and it does so by:
setting the hostname
setting the root directory (mount namespace) to the one specified by the user (via the
-m
flag)setting the user namespace
setting capabilities and seccomp filters (for security)
The Mount Namespace
The mount namespace is used to isolate the filesystem mount points seen by the container process. The mount_set function is responsible for setting the root directory for the container process, and it does so by calling the mount
system call with the appropriate flags (MS_BIND | MS_REC | MS_PRIVATE
) to create a new mount namespace and bind-mount the root directory to the one specified by the user. The result is that the container process will see the root directory as the one specified by the user, and it will be able to access the files in that directory and its subdirectories.
The User Namespace
The user namespace is used to isolate the user and group IDs seen by the container process. The user_namespace_init function is responsible for setting the user namespace, and it does so by calling the unshare
system call with the appropriate flags (CLONE_NEWUSER
) to create a new user namespace. user_namespace_init
relies on user_namespace_prepare_mappings, a function called in barco.c
and used by the parent process (barco) to listen for the child process (the container) to request setting uid
and gid
before updating the uid_map
and gid_map
for the container. The uid_map
and gid_map
files are a Linux kernel mechanism for mapping uids and gids between the parent and child processes. The result is that the container process will see the user and group IDs as specified by the user, and it will be able to access the files in that directory and its subdirectories.
Capabilities and Seccomp Filters
The container process is running as root
(uid 0), but it's not a full root
user. It's a root
user with limited capabilities and seccomp filters. The sec_set_caps function uses libcap
to set the capabilities for the container process, and it does so by calling the cap_set_proc
function with the appropriate flags (e.g. CAP_MAC_OVERRIDE
, CAP_MKNOD
, CAP_SETFCAP
, CAP_SYSLOG
...). This configuration makes it so that the container process will be able to perform only the actions specified by the capabilities.
Seccomp Filters are used instead to limit the system calls that a process can make (in my case, the container). The sec_set_seccomp function blocks sensitive system calls based on Docker's default seccomp profile and other obsolete or dangerous system calls. It does so by calling the seccomp_rule_add
function with the appropriate parameters to specify how calls to a system call should be handled. Let's use the following as an example:
seccomp_rule_add(ctx, SEC_SCMP_FAIL, SCMP_SYS(fchmod), 1, SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))
This instruction tells the Linux Kernel to block calls to fchmod
if the S_ISGID
bit is set in the mode
parameter. The mode
parameter is the second parameter of the fchmod
system call, and it's used to set the permissions of a file. The S_ISGID
bit is used to set the setgid
bit, which is used to set the group ID on execution. The S_ISGID
bit is set when the mode
parameter is S_ISGID | S_IRWXU | S_IRWXG | S_IRWXO
, which is the default value for the mode
parameter. This means that the fchmod
system call will be blocked if the mode
parameter is the default value, which is the case when the fchmod
system call is used to set the permissions of a file. This is a good example of how seccomp filters can be used to block dangerous system calls.
Cgroups
As the container is starting, the parent process (barco) is setting up cgroups for the child process. The cgroups_init function is responsible for setting up cgroups (version 2), and it does so by creating the /sys/fs/cgroup/barcontainer
directory (barcontainer is the hostname for the container) for the new cgroup and then writing cgroups settings to the appropriate files in the cgroupfs. The cgroups settings are:
memory.max
: the maximum amount of memory that the container process can usecpu.weight
: the relative weight of the container process compared to other processespids.max
: the maximum number of processes that the container process can spawn
By setting these values, the container process will be limited in the amount of memory, CPU time and processes that it can use.
Waiting for the Container to Exit
The container is setup, and it's now running. The parent process (barco) is waiting for the container process to exit, and it does so by calling the waitpid
system call with the PID (process ID) of the container. The container I used as an example in the README.md file is running /bin/sh
and waiting for user input. The user can type commands, and the container process will execute them. For example, the user can type ls
and the container process will execute it, listing the contents of the current directory. The container process will continue to run until the user types exit
, which will cause the container process to exit. The parent process (barco) will then exit as well.
Cleaning up
As the container process exits, the parent process (barco) exits too, and it's responsible for cleaning up the resources used by the container process. Starting at the cleanup label in barco.c
, the parent process is responsible for:
freeing the container stack
closing the sockets used to communicate with the container process
removing the cgroup it created
freeing the data structures used by
argtable
Summary
Linux containers are made up of a set of Linux kernel features, and barco
uses them to create a container that is isolated from the host system. It's a very simple implementation, but it's enough to understand how containers work. I learned a lot while working on this project, and I hope that this post can be useful to others who want to learn more about Linux containers and the Linux kernel. If you have any questions, feedback or improvements you'd like to make, feel free to reach out to me directly or on the barco repository!