martin carpenter

contents

most popular
2012/05/05, updated 2012/12/15
ubuntu unity lens for vim
2010/04/14
ckwtmpx

extracting openssl private keys from core dumps

2013/10/22

tags: core OpenSSL python RSA

This normally quiet page saw 6000 visits on the weekend after the Heartbleed OpenSSL bug announcement (April 12-13th 2014). Although the method described here could conceivably work to retrieve private keys via Hearbleed there are better approaches when working with a limited data leak. Rubin Xu's post presents a good summary via Coppersmith's attack.

Given a core dump from an application that has read an OpenSSL RSA private key, how easy is it to find and recover that private key? It turns out that it's not as hard as it might appear with a little python and some ELF-parsing magic courtesy of Eli Bendersky's pyelftools.

The steps were:

Modelling the core

First off, I needed a suitable core dump from some application: I generated one from ssh-keysign(8) using gcore(1) as root. This came out at around 25MB. An ELF core file is composed of segments that reflect the virtual address space of the process when the core was dumped. You can see these sections using readelf(1) or objdump(1). We are interested in load1, load2, ... which are distinguished by a p_type of PT_LOAD. The VMA column is the Virtual Memory Address, and File off gives the byte offset into the core file for the data:

martin@linux:~$ objdump -h core | head -20

core:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 note0         00000558  0000000000000000  0000000000000000  000003f8  2**0
                  CONTENTS, READONLY
  1 .reg/7449     000000d8  0000000000000000  0000000000000000  0000047c  2**2
                  CONTENTS
  2 .reg          000000d8  0000000000000000  0000000000000000  0000047c  2**2
                  CONTENTS
  3 .auxv         00000130  0000000000000000  0000000000000000  0000060c  2**3
                  CONTENTS
  4 .reg2/7449    00000200  0000000000000000  0000000000000000  00000750  2**2
                  CONTENTS
  5 .reg2         00000200  0000000000000000  0000000000000000  00000750  2**2
                  CONTENTS
  6 load1         00000000  0000000000400000  0000000000000000  00001000  2**12
                  ALLOC, READONLY, CODE
  7 load2         00001000  0000000000600000  0000000000000000  00001000  2**12
martin@linux:~$

For 64 bit architectures this means that we will necessarily have a sparse array of data. I modelled this using an array of (non-sparse) contiguous Blocks of data. A Block is a simple object with start and end addresses and its associated data. Squashing adjacent blocks together reduced the number of them from 46 down to 9 which was good for performance.

Here's the Core factory method:

    @classmethod
    def from_file(cls, path):
        core = Core()
        with open(path, 'rb') as fd:
            core.elffile = ELFFile(fd)
        # Add a memory block for each segment found
        for i in xrange(core.elffile.num_segments()):
            segment = core.elffile.get_segment(i)
            header = segment.header
            if header.p_type == 'PT_NOTE':
                next
            elif header.p_type == 'PT_LOAD':
                if header.p_filesz > 0:
                    block = Block.from_segment(segment)
                    core.blocks.append(block)
            else:
                raise RuntimeError('Unknown p_type %s' % header.p_type)
        # Sort memory blocks in place by start address
        core.blocks.sort(key=lambda block: block.start)
        # Squash adjacent blocks
        for i in xrange(len(core.blocks) - 2, -1, -1):
            if self.blocks[i].end + 1 == self.blocks[i+1].start:
                self.blocks[i].end = self.blocks[i+1].end
                self.blocks[i].size += self.blocks[i+1].size
                self.blocks[i].data += self.blocks[i+1].data
                del self.blocks[i+1]
    return core

I wrote an iterator each_addr to return addresses in order, using a small helper next_addr to return the next (possibly non-contiguous) address, and a peek function to return the appropriate slice of data.

Modelling OpenSSL BIGNUM

An OpenSSL RSA key is just a struct containing its parameters as OpenSSL BIGNUMs and some other stuff we don't care too much about. A BIGNUM looks like this:

    typedef struct bignum_st BIGNUM;
    struct bignum_st
        {
        BN_ULONG *d;    /* Pointer to an array of 'BN_BITS2' bit chunks. */
        int top;        /* Index of last used d +1. */
        /* The next are internal book keeping for bn_expand. */
        int dmax;       /* Size of the d array. */
        int neg;        /* one if the number is negative */
        int flags;
        };

I modelled this using python's ctypes:

class Bn(Structure):
    """Class to represent OpenSSL BIGNUM."""

    _fields_ = [
            ('d', POINTER(c_ulong)),
            ('top', c_int),
            ('dmax', c_int),
            ('neg', c_int),
            ('flags', c_int),
            ]

    LIBCRYPTO = cdll.LoadLibrary('libcrypto.so')
    bn2dec = LIBCRYPTO.BN_bn2dec
    bn2dec.restype = c_char_p

    def to_long(self):
        return long(Bn.bn2dec(byref(self)))

    def __str__(self):
        return str(self.to_long())

BN_ULONG is an unsigned long (or unsigned long long) so 8 bytes on my system. An int is 4 bytes. Using this information I was able to write a second iterator with a small test for BIGNUMness (please forgive the excessive nesting!). Python's struct.unpack was used to do the data translation (beware endianism if you try this snippet since it assumes little-endian):

    def each_bn(self):
        """Iterator to return all possible BIGNUMs in the address space."""
        struct_len = 8 + 4*4 # 64 bit pointer + 4 x 32 bit ints (no padding)
        for addr in self.core.each_addr():
            if addr % 8 == 0: # 64 bit aligned
                struct_data = self.core.peek(addr, struct_len)
                if len(struct_data) == struct_len:
                    top = unpack('i', struct_data[8:12])[0] # int
                    dmax = unpack('i', struct_data[12:16])[0] # int
                    if ((top > 5) and (dmax > 5) and (dmax - top < 2)
                    and (dmax - top >= 0) and (dmax < 1000000)):
                        neg = unpack('i', struct_data[16:20])[0] # int
                        d = unpack('Q', struct_data[0:8])[0] # 64 bit pointer
                        if (neg == 0 or neg == 1) and self.core.contains_addr(d):
                            data_len = dmax * 8 # dmax unsigned long longs
                            data = self.core.peek(d, data_len)
                            if len(data) == data_len:
                                flags = unpack('i', struct_data[20:24])[0] # int
                                data_buf = create_string_buffer(data)
                                bn = Bn(d=cast(pointer(data_buf), POINTER(c_ulong)),
                                        top=c_int(top),
                                        dmax=c_int(dmax),
                                        neg=c_int(neg),
                                        flags=c_int(flags))
                                yield bn

I've done a little work to make this efficient but it could be tightened further (the upper bound on dmax for one thing). Since the structs are 8 byte aligned we only need to test one in every eight bytes. If peek returned enough consecutive data then it is unpacked and the values checked for sanity. The final check is for d, which is a pointer to another location which contains the actual value of the BIGNUM. Since checking this pointer is relatively expensive this is done last (with only 9 blocks to check binary search was not a significant speed improvement). If all of those tests pass then this might be a real BIGNUM so it is yielded to the triage routine.

RSA recap and triage strategy

RSA uses a public/private key pair, each containing some parameters. Skipping the details (how parameters d, e are chosen, other values that are stored in the private key for performance reasons, ...) the scheme works approximately as follows:

This gives us a straightforward algorithm:

Using just python's native arithmetic this ran in 30 seconds on my 25MB test core file (O(n) in both time and space). This makes it a very feasible operation.

Summary