k1R4 | Hacking the Heap [LinuxKernel]

Hacking the Heap [LinuxKernel]

15 May 2024

pwn kernel

Introduction

The kernel heap is a lot more complicated and chaotic than userspace heap. Let’s break it down one step at a time. Here are some things to keep in mind before going forward:

kmalloc and kfree are used to allocate and free memory respectively. they are the most commonly used heap API functions in the kernel codebase
When memory is requested through kmalloc, it is taken from a slab, which in turn consists of pages
pages are continous memory units aligned by 0x1000, on x86 atleast
pages are contiguous in virtual memory but need not be physically contiguous
slabs are of various orders starting from 0. A slab of order 0, consists of 2⁰ = 1 page, a slab of order 1, consists of 2¹ = 2 pages.
There are separate allocators for pages and slabs
There is only one page allocator in the kernel, known as the Binary Buddy Allocator. It exposes certain page allocation API such as alloc_pages, __free_pages
There are 3 slab allocators: SLOB, SLAB and SLUB. They use the page allocator for obtaining slabs. They expose kmalloc and kfree APIs

Let’s now look at each of these allocators one by one and how it all adds up. At the end of each section, there will be some links for more info on the same. They are references I used or came across while learning about the kernel heap.

SLOB

Stands for Simple List of Blocks. This slab allocator is typically used in embedded systems or machines where memory is very limited. This allocator has been deprecated since v6.2. Some features of SLOB are:

Implemented through less code since its simpler that other allocators
It is memory efficient, however doesn’t scale and suffers from fragmentation
Stores the freelist in the object itself
Has multiple freelists based on size namely: small, medium and large
If no memory is available in the freelist, extends heap using page allocator
Uses K&R style allocation (Section 8.7, The C Programming Language)

LWN Article on SLOB Paper on SLOB exploitation

Before delving into SLAB and SLUB allocators, we have to first understand kmalloc-caches

kmalloc-caches

kmalloc-caches are “isolated” groups of slabs that are used to allocate memory of a particular size or for allocating specific objects. I put isolated in quotes since it is not always true. More about that in the upcoming sections. Caches are often initialized with some empty slabs to speed up allocations. By default there are a lot of caches that are created for objects of various sizes and types. Some of them being:

kmalloc-32
kmalloc-192
kmalloc-1k
kmalloc-4k
files_cache
inode_cache

The caches starting with “kmalloc” are called as general purpose caches. By default an allocation will end up in one of those caches depending on size requested. For example a request of size 32, ends up in kmalloc-32, anything above that will be allocated in kmalloc-64 which is the next cache. There are certain caches that are reserved for objects of a specific type, such as files_cache for struct file and inode_cache for struct inode_data. You can get the slabs on your local machine by running sudo cat /proc/slabinfo in your shell.

Each cpu (hardware thread) contains its own linked list of caches. This is the reason why you might see exploits setting affinity to a particular cpu. Each cache contains 3 more linked lists, namely:

slabs_full: contains slabs that are completely full (no memory left)
slabs_partial: contains slabs that are partially full (some memory available)
slabs_free: contains slabs that are completely empty (can be re-used or returned to page allocator)

 +-----------+                   +-------+                   +-----------+
 | lastcache | ----------------> | cache | ----------------> | nextcache |
 +-----------+                   +-------+                   +-----------+
                                   / | \
                   _______________/  |  \_________________
                  /                  |                    \
                 /                   |                     \
                /                    |                      \
     +------------+          +---------------+            +------------+
     | slabs_full |          | slabs_partial |            | slabs_free |
     +------------+          +---------------+            +------------+
           |                         |                           |
           |                         |                           |
           v                         v                           v
       +-------+                 +-------+                   +-------+
       | slabs |                 | slabs |                   | slabs |
       +-------+                 +-------+                   +-------+
           |                         |                           |
           |                         |                           |
           v                         v                           v
       +-------+                 +-------+                   +-------+
       | pages |                 | pages |                   | pages |
       +-------+                 +-------+                   +-------+  
         /  \                      /  \                        /  \
        /    \                    /    \                      /    \
       /      \                  /      \                    /      \
   +-----+  +-----+          +-----+  +-----+            +-----+  +-----+
   | obj |  | obj |          | obj |  | obj |            | obj |  | obj |
   +-----+  +-----+          +-----+  +-----+            +-----+  +-----+

Image Source

This is the core mechanism used in SLAB and SLUB allocators. This leads to lesser fragmentation and scales well with more memory. Since the slab allocator holds onto some empty slabs, the allocation is fast and doesn’t always have to fallback on the page allocator.

Some API functions

kmalloc_cache_create() - Create a new cache for an object and add it to the global list of caches
kmalloc_cache_destroy() - Remove cache from global list of caches and frees slabs held by cache
kmem_cache_alloc() - Allocates memory from a given cache
kmem_cache_free() - Frees memory back to the cache

caches and SLUB allocator

SLAB

Note: slab is a structure consisting of page(s) whereas, SLAB is a slab allocator

SLAB allocator is the first allocator to implement caching through kmalloc-caches. It speeds up (de)allocation of frequently used objects. The path for allocation looks something like:

Traverse the current cpu’s cache list and find required cache
Search the partial slabs list and get the first one
- if no slab is found in partial list, search free list
- if no slab is found in free list, then allocate a slab and add it to the free list
Finds first free “slot” in the slab and returns that
Marks the “slot” as used

Although the SLAB allocator increased performance and reduced fragmentation, it had some pitfalls. For systems with a large number of cpus, there exists queues per node per cpu. In such cases, the allocator uses up a lot of memory on startup.

Paper on SLAB allocator Kernel docs on SLAB allocator

SLUB

SLUB was designed as a drop-in replacement to the SLAB allocator. It simplified the implementation, getting rid of complex queues and metadata. Another major change is that the object freelist is stored inline, in the freed objects itself. SLUB also added support for merging slabs, to reduce memory overhead. This is done implicitly. Objects of similar sizes are cached together unless explicitly specified otherwise. This is known as cache aliasing, where a cache, actually uses the general purpose cache to back allocations. Other than these I can’t think of any other major changes between SLAB and SLUB from an exploitation point of view.

You can find cache aliases by running ls -l /sys/kernel/slab/. The symlinks indicate aliasing. Every entry with the same symlink destination, indicate that they are backed by the same cache.

kmalloc() is now just a wrapper to kmem_cache_alloc_trace(). Similarly kfree() uses kmem_cache_free() internally. The size requested is used to determine which general purpose cache can be used to allocate it. kmem_cache_alloc() should be used when requesting allocation from a particular cache.

Presentation on SLAB and SLUB Presentation on SLOB, SLAB and SLUB Structures in SLUB allocator

Buddy allocator

The buddy allocator a.k.a the binary buddy allocator, or simply the page allocator, takes on the role of allocating physical pages. It’s API is used by the slab allocators and vmalloc() internally. It is the most low level memory allocator. The return value is usually struct *page or the starting virtual address of the pages. Every physical page in use has an associated struct page, to keep track of metadata. Multiple pages part of the same struct page.

The allocation is based on order specified, similar to slab order, where order n gets you 2ⁿ pages. On freeing pages, it doesn’t destroy the page table entry but rather keeps the page for future allocations.

Pages of same order are grouped into pairs, known as buddies
When a block is freed, its buddy is checked. If the buddy has already been freed, then they are merged into a block of higher order
The merging is attempted continously until block of MAX_ORDER is reached
During allocation, when block of requested order is present, it is returned as is
If not, a block of higher order is split iteratively until required order is reached. The remaining blocks from splitting, are added to the freelist

Allocating pages of order 1, when there are only pages of order 3 available:

      order    +=================+                  Requesting <--+
        |      |        0        |                    process     |
        |      +=================+                                |
        |      |        1        |<-------------------+           |
        |      +=================+                    |           |
        |      |        2        |<---------+      +-----+     +-----+
        |      +=================+          |      | 2^1 |<-+->| 2^1 |
        |      |        3        |--+       |      +-----+  |  +-----+
        |      +=================+  |  +---------+     +---------+
        |      |        .        |  |  |   2^2   |<-+->|   2^2   |           
        |      |        .        |  |  +---------+  |  +---------+
        |      |        .        |  |     +-------------------+
        |      +=================+  +---->|     2^3 block     |  
        |      |    MAX_ORDER    |        +-------------------+
        v      +=================+

Image source

Some API functions

alloc_pages() - Allocates 2^order number of pages and returns a struct page
get_free_page()- Allocates a single page, zeroes it and returns virtual address
__free_pages() - Free 2^order number of pages from given struct page
free_page() - Free a page from the given virtual address

Kernel docs for buddy allocator Paper on buddy allocator

Exploitation primitives & techniques

Heap overflow

This primitive allows overwriting neighboring objects in a slab. Usually in such cases, we aim to get an object with function pointers or pointers in general, in hopes of overwriting them for stronger primitives such as rip control or arbitrary r/w.

Use after free

This can also be caused by an incorrect free primitive. This primitive allows us to hold a reference to a freed object. Allocating a different object in the same cache, allows us to cause type confusion. This is because we have 2 references to the same memory but through different contexts. Use the type confusion to control some member in the object and build stronger primitives from there.

Freelist poisoning

This technique can be used to hijack the object freelist in the SLUB allocator since it stores the freelist inline. This technique is partially mitigated through CONFIG_SLAB_FREELIST_HARDENED, although it can be bypassed. More about that in the upcoming sections. This technique allows allocation of arbitrary kernel memory.

Spray and pray

Heap sprays are a common technique used to increase exploit reliability. It just means spamming a lot of allocations of a particular object to counteract randomness caused by mitigations and system noise (allocations beyond our control). In some cases, the spray has to be controlled to achieve maximum reliability.

Cross-cache

This is a somewhat recent technique where vulnerable objects can overflow onto objects in another cache, or UAF objects can be allocated in a different cache. This requires careful manipulation of the slab (and page) allocator. Essentially, we cause an entire slab containing our vulnerable object to be freed and re-used by another cache. When the cross-cache is to a cache of different order, it requires manipulation of the page allocator as well.

kernelCTF cookbook

Some hardening features

CONFIG_SLAB_FREELIST_RANDOM

This config option in the kernel randomizes the order of the slab freelist at runtime. The order of allocation in the slab is non-deterministic. Exploits which rely on overflowing into neighboring objects or require object at specific alignment in a slab lose reliability. Heap spray can be used to counteract this.

CONFIG_SLAB_FREELIST_HARDENED

This config option requires the SLUB allocator. It encrypts the inline freelist pointers present in freed objects with a random 8 byte value, generated at runtime. The encryption is a basic bitwise xor operation. This prevents information leak and hijacking the freelist. It can be bypassed by getting a heap leak and an encrypted pointer leak and xoring both to get the encryption key.

CONFIG_MEMCG_KMEM

Although strictly not a hardening config option, it makes exploitation ever so harder. This causes some commonly used objects, to be put in kmalloc-cg-x caches instead of the usual kmalloc-x caches. This is because these objects are allocated with the flag SLAB_ACCOUNT. This can be overcome using cross-cache.

CONFIG_RANDOM_KMALLOC_CACHES

This config option creates multiple copies of kmalloc-cg-x and kmalloc-x caches. The copy of the cache chosen for allocation, depends on a random seed and the return address for kmalloc(). This reduces exploit reliability a lot. However cross-cache can be used to overcome this.

CONFIG_SLAB_VIRTUAL

This config option still hasn’t been merged with mainline yet. It aims to prevent cross-cache and UAFs in general by preventing re-use of the same virtual address for a slab across different caches. After looking at a bunch of _mitigation exploits for kernelctf, I don’t see any bypasses for this yet. Please correct me if I’m wrong.

Objects

You can use tools such as pahole and kernel_obj_finder to get objects available in the build you’re working with.

Elastic objects

These objects are called so, because they can be allocated in various general purpose caches. Although the introduction of CONFIG_MEMCG_KMEM, moved most of them to the kmalloc-cg-x caches. These objects are useful because of their dynamic size and user controlled content.

struct msg_msg
- allocate: msgsnd()
- free: msgrcv()
- size: 0x40 - 0x1000

struct msg_msg {
	struct list_head m_list;
	long m_type;
	size_t m_ts;		/* message text size */
	struct msg_msgseg *next;
	void *security;
	/* the actual message follows immediately */
};

struct msg_msgseg
- allocated and freed using same path as msg_msg
- allocation occurs when msg datalen > 0xfc8
- the remaining bytes are put in astruct msg_msgseg
- size: 0x10-0x1000

struct msg_msgseg {
	struct msg_msgseg *next;
	/* the next part of the message follows immediately */
};

struct simple_xattr
- allocate: setxattr() on a tmpfs filesystem
- free: removexattr()
- size: 0x30 - ?

struct simple_xattr {
	struct rb_node rb_node;
	char *name;
	size_t size;
	char value[];
};

struct user_key_payload
- allocate: add_key()
- free: keyctl_revoke() + keyctl_unlink()
- size: 0x20 - ?

struct user_key_payload {
	struct rcu_head rcu;
	unsigned short  datalen;
	char		    data[] __aligned(__alignof__(u64));
}

Critical objects

Being able to overwrite or control these objects usually leads to privilege escalation. These objects usually contain flags that control permissions. Since they are critical, they are usually allocated in dedicated caches with SLAB_NOMERGE, SLAB_ACCOUNT flags.

struct cred
- private cache: cred_jar
- allocate: fork(),clone()
- free: exit()
- size: 0xa0
- overwrite uid,gid members for changing privileges of process
struct file
- private cache: files_cache
- allocate: open()
- free: close()
- size: 0x300
- overwrite f_mode to change file access permissions

Some simple challenges

Pawnyable (heap overflow, UAF)
CISCN2017 - baby driver (UAF)
CUCTF2019 - hotrod (userfaultfd race + UAF)
bi0sCTF2022 - k32 (heap overflow) (shameless plug :D)
backdoorCTF2023 - empdb (userfaultfd race + freelist poisoning)

The end

I started writing this for my reference in the future, but it turned into a blog post. I will probably update this few times until I’m satisfied with it. I would like to thank a few people, for motivating and helping me with learning Linux Kernel Exploitation: Cyb0rG, 3agl3 & kylebot