Building a Custom Container System from Scratch

Shipping containers

Containerization has revolutionized how we build, package, and deploy applications. Platforms like Docker and Kubernetes have made containers mainstream, but have you ever wondered what it takes to build your own container system from scratch?

In this post, we‘ll explore the fundamental building blocks of containers and walk through implementing them in Linux, without relying on any existing container software. By the end, you‘ll have a deeper understanding of how tools like Docker work under the hood and gain the ability to build your own simple DIY container system.

Let‘s jump in and start from the ground up, using nothing more than what the Linux kernel provides.

The Building Blocks of Containers

While Docker and other platforms provide user-friendly tools and APIs for managing containers, at their core containers are made possible by a handful of key Linux kernel primitives:

  • Namespaces provide isolation between containers and the host
  • Control groups (cgroups) limit and account for resource usage
  • A union filesystem allows building container images as layers

Let‘s look at each one and how to leverage them directly.

Linux Namespaces

Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources. Linux implements seven kinds of namespaces:

  1. Mount (mnt) – Isolates filesystem mount points, giving each container its own rootfs
  2. Process ID (pid) – Isolates the process ID number space, letting containers have their own init process and process tree
  3. Network (net) – Isolates the network stack, giving each container its own IP, interfaces, ports, etc.
  4. Interprocess Communication (ipc) – Isolates interprocess communication between containers
  5. UTS – Isolates host and domain names
  6. User ID (user) – Isolates user and group ID number spaces
  7. Control group (cgroup) – Isolates cgroup root directory

With namespaces, processes running inside a container only see the resources assigned to that container – they get their own "view" of the system. This is the foundation of container isolation.

To put a process in a namespace, you can use the unshare or nsenter commands. Here we create a new process in its own PID namespace:


$ unshare --pid --fork bash
$ echo $$  
1

Inside the namespace the process has PID 1, but if we look at it from outside, it has a different PID on the host:

  
$ ps aux | grep ‘bash‘
user    4029  0.0  0.0 11460 1504 pts/0    S    17:21   0:00 bash

Each namespace is referenced by a file in /proc/[pid]/ns/. Two processes that share the same namespace will have the same inode number on the namespace file:

  
$ ls -l /proc/$$/ns/pid
lrwxrwxrwx 1 root root 0 Mar  9 17:21 /proc/4029/ns/pid -> pid:[4026533887]

Compare that to the PID namespace for the shell outside the container:


$ ls -l /proc/$$/ns/pid  
lrwxrwxrwx 1 root root 0 Mar  9 17:45 /proc/3926/ns/pid -> pid:[4026531836]

The inode numbers differ, confirming these processes are in separate namespaces. This is how Docker isolates containers – each container is just a set of namespaces with some isolated processes inside.

Control Groups (Cgroups)

While namespaces provide isolation, control groups or cgroups are used to implement resource limiting. Cgroups limit the amount of resources like CPU, memory, disk I/O, and network bandwidth that a process or container can use.

Cgroups form a hierarchy, with each group represented by a directory under /sys/fs/cgroup. Inside each group directory are pseudo-files that let you configure limits and view current usage.

For example, to limit a group to 1GB of memory:


$ cd /sys/fs/cgroup/memory 
$ mkdir testgroup
$ echo 1G > testgroup/memory.limit_in_bytes

Then to add a process to this group:

  
$ echo 12345 > testgroup/tasks

Now the process with PID 12345 is limited to 1GB RAM. We can see its current usage:


$ cat testgroup/memory.usage_in_bytes
1216512

Docker uses cgroups extensively to enforce the resource constraints you specify when running a container. This is why a Docker container only uses the memory and CPU you allow it.

Union Filesystems

The final piece is the union filesystem, which allows Docker to efficiently store and build container images using layers.

A union mount lets you overlay multiple filesystems or directories into a single coherent file tree. With Docker, each layer of a container image is stored as a separate directory, and these are union mounted into a single rootfs when you start the container.

Typically Docker uses OverlayFS for its union filesystem. To create your own overlay mount of two directories:


$ mkdir dir1 dir2 merged
$ echo "from dir1" > dir1/foo  
$ echo "from dir2" > dir2/bar

$ mount -t overlay overlay -o lowerdir=dir1,upperdir=dir2 merged $ tree merged merged ├── bar └── foo

Now the merged directory contains the union of dir1 and dir2, with files from dir2 taking precedence. Docker uses this to build images efficiently – each layer only stores its own changes, and the overlay produces the final image.

Putting It All Together

With namespaces, cgroups, and a union filesystem, we have the key ingredients to build a container system. Here are the high-level steps:

  1. Create the namespaces for the container with unshare
  2. Set up the rootfs for the container by union mounting the image layers
  3. Create and configure the cgroups to enforce resource limits
  4. Start the container process in the namespaces

Here‘s a simplified version in code:


child_pid = clone(CLONE_NEWPID | CLONE_NEWNS); 
if child_pid == 0:
    mount("overlay", root_dir, "overlay", MS_NODEV, "lowerdir=images/img1,upperdir=images/img2") 
    cg_memory = create_cgroup("memory", container_id)
    cg_cpu = create_cgroup("cpu", container_id)
    set_memory_limit(cg_memory, mem_limit)
    set_cpu_shares(cg_cpu, cpu_shares)   
    add_task_to_cgroup(cg_memory, child_pid)
    add_task_to_cgroup(cg_cpu, child_pid)
    exec(["./container_cmd"])

The clone() system call creates the new namespaces, then the child process unions mounts the container rootfs, creates and configures the control groups, and finally execs the container process. This is a simplification of what Docker does when you run a container.

Pros and Cons of a Custom Container System

So when might you want to implement your own container system instead of using Docker? Here are some potential reasons:

  • Learning and understanding – Building your own system is a great way to deeply learn the underlying container technologies
  • Customization – Roll your own container system if you need specific functionality or constraints not supported by existing tools
  • Minimalism – If you only need basic container features, a custom lightweight solution could reduce complexity
  • Tight integration – A custom system may make sense if containers need to integrate closely with other parts of your infrastructure

However, there are significant drawbacks to building your own container system from scratch:

  • Complexity – Correctly implementing all the components of a container runtime is difficult and error-prone
  • Maintenance – A custom system means you‘re on the hook for maintaining and debugging it over time
  • Lack of ecosystem – Existing platforms like Docker have a large ecosystem of tools, images, and support that a custom setup wouldn‘t have
  • Reinventing the wheel – Unless your use case is unique, you‘re likely duplicating effort that existing open source projects already solve

For most use cases, you‘re better off leveraging the work of the community and using a production-grade open source container runtime. But understanding how to build your own is still a valuable learning exercise.

Further Reading

Hopefully this post gave you a taste of the key components that go into a container system. If you want to go deeper, here are some additional resources:

Happy containerizing!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *