A Quick Journey Into the Linux Kernel | Blog

I recently revisited my old operating systems coursework and realized how much of it felt too abstract: I'd learned about processes, scheduling, and memory management, but mostly in a theoretical sense. That's when I decided to pick up Robert Love's Linux Kernel Development book. Despite being written for the (now quite old) 2.6 kernel series, its pages still offer good insights into the fundamental ideas behind Linux internals. Reading it reminded me that while specific APIs and data structures evolve over time, the core design principles remain quite consistent.

I wanted a concrete understanding of how "real world" operating systems solve everyday problems - things like scheduling processes fairly, dispatching interrupts promptly, and managing memory without wasting CPU cycles. Love's book was a great starting point, and it prompted me to dig even deeper into current kernel sources and documentation to see what had changed (and, quite often, what had stayed the same). In this blog post, I'll walk you through some of the biggest takeaways from my kernel deep dive, with some personal commentary alongside the factual details.

Getting Started in Kernel Land

Developing inside the kernel is vastly different from everyday userspace programming. For one, you have to let go of many familiar comforts. The standard C library is off-limits, though the kernel does provide simplified variants of some libc functions in its own lib/ directory. You can't rely on all those headers you normally include in user space, and you need to be aware that the kernel has its own usage conventions for GNU C extensions.

Equally important: memory protection is not the same as it is in user space. In user space, if you accidentally dereference a pointer that doesn't belong to you, you'll likely trigger a segmentation fault and your process may crash. In the kernel, a bad memory access can lead to a fatal "oops" message, potentially destabilizing the entire system. You also can't casually do floating-point operations in the kernel - at least, not in the usual way - because the kernel avoids using the floating-point unit by default for performance and complexity reasons.

Then there's the stack. Instead of the (relatively) large, growable stack you're used to, the kernel provides a small, fixed-size stack (often just a few pages, e.g., 8 KB on many 64-bit systems). Stack overflows can happen silently, so you have to be disciplined about local variables. And while concurrency is a concern for any modern software project, the kernel is at another level entirely. Interrupts can arrive at any time, preemption may occur in unexpected places, and symmetric multiprocessing (SMP) means multiple CPUs might be running your kernel code simultaneously. It's a whole different game!

Processes: More Than Just "Tasks"

One of the first big surprises for me was the realization that, in Linux, processes and threads aren't truly distinct entities under the hood. Both are represented by the same fundamental data structure, the task_struct. When you use clone(), fork(), or pthread_create() in user space, you're really just creating variations on the same "task" theme. The arguments you pass to clone() essentially decide which resources - like memory, file descriptors, or signal handlers - get shared and which get duplicated.

A neat trick Linux uses after it spawns a child process is to let the child run first. If the child calls exec(), it instantly replaces its address space with a new program, making copy-on-write optimizations very efficient. If the parent were allowed to run and write to memory first, the kernel would have to create extra copies of pages that might turn out to be useless if the child soon calls exec().

To access the current task, kernel code typically uses a macro named current. Its implementation is architecture dependent. On some architectures, current resides in a specific register, while on others (like x86), it's accessed through the bottom of the stack.

Finally, terminating a process doesn't mean it disappears right away. A defunct task becomes a "zombie" until its parent calls wait() (or the equivalent) to read its exit code. If a parent crashes, its children get "re-parented" to init (PID 1), and eventually init cleans them up. So if you ever see "zombie" processes around, that's exactly what's going on.

Scheduling: And Fairness for All… or Not

Scheduling is one of the kernel's core jobs: deciding which process (or thread) gets to run next. Linux uses scheduling classes to manage different policies, with the Completely Fair Scheduler (CFS) being the primary one for normal, non-real-time tasks. The top-level scheduling function, schedule(), checks each scheduling class in order of priority. Once it finds a class with a runnable process, it hands over scheduling decisions to that class.

CFS aims for fairness: it tries to give each runnable process a proportion of the CPU, guided by nice values (ranging from - 20 for high priority to +19 for low priority). Under CFS, there's no fixed timeslice. Instead, the scheduler tracks something called virtual runtime (vruntime), which accumulates faster for low-priority tasks and more slowly for high-priority tasks. Data structures like a red-black tree help keep track of which process should run next (the one with the smallest vruntime).

For real-time tasks (with priorities from 0 to 99), the scheduler prioritizes them over normal "nice value" tasks. Linux also supports kernel preemption, meaning that even kernel-mode code can be forcibly interrupted to run a higher-priority task, as long as no spinlocks or other non-preemptible regions are held. This is all part of the kernel's drive to minimize latency and keep the system responsive.

Making Calls into the Kernel: System Calls

System calls form the boundary between user space and kernel space. Although we call them "functions", system calls are really special entry points triggered by architecture-specific instructions like syscall or int 0x80 on x86. Once in the kernel, the syscall handler routes you to the correct function via a system call table. Each architecture has its own table, so if you add a new system call, you have to add an entry for every architecture you want to support.

Because user space can pass in malicious or malformed pointers, system calls must carefully validate their parameters - hence functions like copy_from_user() and copy_to_user() which safely move data between user space and kernel space. These functions can block, and they'll return errors if the memory access is invalid.

In practice, modern kernel developers rarely add new system calls unless absolutely necessary. More commonly, you'd expose functionality via device files or the sysfs interface, letting user programs interact through read/write operations or specialized system files.

Kernel Data Structures: Lists, Trees, and More

No need to reinvent the wheel: the Linux kernel includes a bunch of built-in data structures. The linked list implementation, for instance, is a circular doubly linked list that you can embed directly into your own structures. There's a set of macros and helper functions to add, remove, and iterate over these lists without a fuss. You can build stacks, queues, or other patterns on top of them.

For maps, the kernel offers specialized structures that are often key'd by user IDs or other integral IDs, typically using red-black trees under the hood. There's also the kfifo interface for ring buffers (a classic FIFO queue approach) and macros for building your own specialized structures around them. The overall design is minimalistic but powerful, and it's all there to keep you from needing to write your own code for these common patterns.

Interrupts and Interrupt Handlers

Interrupts are how hardware devices let the CPU and kernel know something needs attention. Each interrupt has a unique number and a corresponding Interrupt Service Routine (ISR) registered in the kernel. When an interrupt arrives, the CPU halts whatever it's doing, jumps to the ISR, and (ideally) returns after that code does the minimum necessary work.

Why just the minimum? Because while you're in the ISR's "top half", interrupts on that line are disabled, and system-wide performance can degrade if you linger there. The heavy lifting happens in a "bottom half", which the kernel implements using mechanisms like tasklets, softirqs, or workqueues. This deferred processing model helps keep interrupt latency low.

Every interrupt handler in Linux returns either IRQ_HANDLED to confirm "yep, that was for me," or IRQ_NONE if it wasn't actually generated by this particular device. In many modern systems, multiple devices share interrupts, so you need a quick way to detect whether you're the intended recipient.

Bottom Halves and Deferring Work

To avoid spending too much time with interrupts disabled, Linux defers the bulk of work to "bottom halves". These bottom halves run with fewer restrictions and at more opportune moments. The kernel offers three main facilities for this: Softirqs (statically allocated, can run on multiple CPUs), Tasklets (built on top of softirqs, dynamically created, cannot run two of the same type on two CPUs simultaneously), and Workqueues (run in process context, can sleep/block, each CPU has its own kernel worker thread).

If your deferred routine needs to sleep (e.g., waiting for a resource), you must use a workqueue. Otherwise, you can stick with tasklets or softirqs for performance. Most driver authors end up using tasklets or workqueues, as softirqs require more complex concurrency handling and must be registered statically.

Kernel Synchronization: The Art of Staying Safe

Concurrency issues are everywhere in the kernel. You have interrupts, bottom halves, kernel preemption, SMP, sleeping, and more. To survive in this environment, you need to rely on various synchronization primitives and design patterns that ensure safe data access.

At the most basic level, the kernel provides atomic operations (atomic_t, atomic64_t) and bitwise operations that run atomically. Then there are locks:

Spinlocks: Busy-wait locks that can be used in interrupt context (because they don't block). If you're using them in an interrupt handler, you also need to disable interrupts while holding the lock.
Reader-writer spinlocks: Let multiple readers in but only one writer at a time. Readers can starve the writer, though.
Semaphores: Counting semaphores that allow you to block if the resource is unavailable. For a single "token," they act like a mutex.
Mutexes: Similar to semaphores, but only the thread that locked it can unlock it. No recursive locking allowed.
Completions: Lightweight signals that one thread can use to notify another that "hey, I'm done with my work."

There are also advanced constructs like sequence locks, which give preference to writers. And if you need to coordinate with interrupts, bottom halves, or kernel preemption, you can disable them in a localized way to protect your critical sections. It's best to assume everything is happening at once because, in kernel land, it often is.

Timers and Timekeeping

Another piece of kernel magic is how time is tracked and used. The hardware platform provides a system timer that ticks at a frequency represented by the HZ constant in the kernel. Each "tick" triggers a timer interrupt, and in response, the kernel updates the "jiffies" counter (the global variable holding the number of ticks since boot), maintains system uptime, updates the load average, and so on.

Timers in the kernel are either periodic or dynamic. Dynamic timers let you schedule a function to run after some delay. When a dynamic timer expires, the kernel's timer softirq runs the appropriate callback in bottom-half context. If you ever need to cancel a timer, make sure you use the "sync" versions of the cancellation functions - like del_timer_sync() - if there's any possibility the timer is running on another CPU.

For short delays where sleeping is not possible (like a microsecond-scale wait in driver code), you can call udelay(), ndelay(), or mdelay(). Otherwise, if you can afford to yield the CPU, schedule_timeout() is your friend.

Memory Management: Pages and Allocations

Inside the kernel, physical memory is broken down into pages. Each page is usually tracked by a struct page, describing who owns it (user processes, kernel allocations, etc.). Pages are grouped into zones such as ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM. Each architecture might vary in how many zones it has.

If you need physically contiguous memory of size 2^order pages, you can use alloc_pages(). For most small allocations, you'll use kmalloc(), which returns a physically contiguous chunk of memory. There's also vmalloc() for allocations that need only be contiguous in virtual space (with a performance hit), but not physically contiguous.

An entire subsystem called the "slab allocator" (or SLUB, depending on your kernel version) manages the creation of caches for commonly allocated objects. This helps to recycle memory efficiently. You can even create your own custom caches for structures you allocate frequently in your driver or subsystem.

The kernel also provides dedicated per-CPU variables and APIs so you can store data separately on each CPU without needing to lock it. Access to those variables requires caution: it's usually a bad idea to read another CPU's per-CPU variable without locking.

The Virtual Filesystem (VFS)

If you've ever used file I/O in Linux (and who hasn't?), you've indirectly interacted with the VFS. It presents a single, uniform interface for all filesystems - ext4, XFS, NFS, you name it. Under the hood, the VFS uses a set of common object types like the superblock, inode, dentry, and file structs. Each object has an associated table of function pointers - methods for reading, writing, looking up entries, etc.

An inode stores file metadata (ownership, permissions, timestamps, etc.). A dentry represents a directory entry for a file or directory component of a path. The kernel caches these dentries in the dcache to speed up path lookups. Meanwhile, the superblock object holds information about an entire mounted filesystem, and the file object represents an open file with specific access modes. The genius here is that when you open a file, you don't really need to care whether it's on an ext4 partition, a network share, or a ramdisk. It all just works, thanks to the VFS layer.

The Block I/O Layer

Storage devices are usually treated as block devices in Linux, meaning the kernel can read or write them in multiples of fixed-size sectors. Above that is the filesystem layer, which deals with blocks (multiples of sectors) up to a maximum of the system's page size.

Older kernels used a "buffer head" structure to represent each block in memory. Modern kernels often rely on bio structures, which allow for scatter-gather I/O (i.e., non-contiguous pages forming a single operation). When you initiate a read or write, the kernel builds a bio, sets up a list of pages, offsets, and lengths, and then queues the request in the block device's request queue. A specialized I/O scheduler then merges and sorts those requests to minimize disk seeks - crucial for spinning disks. On SSDs, the NOOP or deadline scheduler might suffice because there's little penalty for "random access".

Debugging the Kernel

Debugging kernel code can be intimidating. Thankfully, there's the printk() function, which is akin to printf() in user space but works reliably in almost any kernel context. If something goes really wrong, you may see an "oops" or a full kernel panic. An "oops" is an error from which the kernel attempts to recover; a panic means the kernel can't recover at all.

Sometimes the oops or panic message prints raw addresses. Tools like ksymoops (or enabling CONFIG_KALLSYMS_ALL) can decode those addresses into function names and line numbers. The kernel also supports kgdb, which enables source-level debugging over a serial or network connection, though it can be tricky to set up.

If you find yourself suspecting a bug, you can use macros like BUG_ON() or panic() to intentionally crash when certain conditions are met. It sounds extreme, but purposeful crashing can sometimes be the quickest way to diagnose a fatal flaw - especially when dealing with race conditions or memory corruption.

Portability: Why Linux Runs on Everything

One of the strongest qualities of the Linux kernel is its portability. Whether it's running on a tiny embedded board, a standard x86 PC, or a massive mainframe, Linux keeps most of its code architecture-agnostic. Each platform implements its own low-level routines (like how to switch processes, handle exceptions, or manage the MMU), but the common kernel subsystems remain the same.

If you find yourself writing kernel code, never assume things like word size, endianness, or even the timer frequency. Use kernel macros and opaque types (u32, atomic_t, etc.) and rely on the appropriate APIs for alignment or converting between big-endian and little-endian. The kernel is packed with these kinds of helpers precisely because code must run cleanly across multiple architectures.

Wrapping Up

Diving into kernel development can feel unfamiliar at first. The limited stack, concurrency challenges, and specialized memory management make it quite different from user-space programming. But once you start understanding how things fit together, it's rewarding - you gain direct control over hardware, design low-level data structures, and see the impact of your changes in real time.

That said, it's not always smooth sailing. Debugging is more difficult, mistakes can crash the entire system, and there are no safety nets like in user space. But for engineers interested in how computers really work, kernel development provides valuable insights into concurrency, performance, and the trade-offs behind a well-designed, portable system.

If you're interested, a good starting point is experimenting with kernel modules, exploring the scheduler in sched.c, or looking into how the block layer handles I/O. Even older resources like Robert Love's Linux Kernel Development offer useful context for understanding the principles that still shape modern kernels. There's always more to explore.