Core dump与backtrace

项目中使用Core dump来Debug程序崩溃问题，在无盘环境下，往往无法在规定时间内及时传输出形成的大的Core file，在刘博的指导下，希望同时通过backtrace来记录系统崩溃时的log，这样在没有Core file的情况下，也可以大致上分析产生问题的地方。我也趁此机会好好学习了一下如何用Core dump来debug，如何实现backtrace和分析backtrace。

1. Core dump

在大多数Linux发布版中，Core dump是默认不开启的。可以使用命令ulimit -c来查看core file的大小限制以及配置开启core dump。

$ulimit -c
0 #0个字节说明不能创建core file, core dump没有开启

1.1 开启core dump

编辑/etc/profile，找到如下行

# No core files by default
ulimit -S -c 0 > /dev/null 2>&1

修改为

# No core files by default
ulimit -S -c unlimited > /dev/null 2>&1

可编辑/etc/sysctl.conf，配置core file文件名以及文件存储位置等，如下：

kernel.core_uses_pid = 1 - 把该程序的进程号附在core file名字后，生成文件名为core.PID。
fs.suid_dumpable = 2 - 确保为调用了setuid的程序生成core file，该变量值如下 (参考2)：

0 (default) - This provides the traditional behaviour. A core dump will not be produced for a process which has changed credentials (by calling seteuid(2), setgid(2), or similar, or by executing a set-user-ID or set-group-ID program) or whose binary does not have read permission enabled.
1 (“debug”) - All processes dump core when possible. The core dump is owned by the file system user ID of the dumping process and no security is applied. This is intended for system debugging situations only. Ptrace is unchecked.
2 (“suidsafe”) - Any binary which normally would not be dumped (see “0” above) is dumped readable by root only. This allows the user to remove the core dump file but not to read it. For security reasons core dumps in this mode will not overwrite one another or other files. This mode is appropriate when administrators are attempting to debug problems in a normal environment.

kernel.core_pattern = /tmp/core-%e-%s-%u-%g-%p-%t - When the application terminates abnormally, a core file should appear in the /tmp. The kernel.core_pattern sysctl controls exact location of core file. You can define the core file name with the following template whih can contain % specifiers which are substituted by the following values when a core file is created:

**%% **- A single % character
**%p **- PID of dumped process
**%u **- real UID of dumped process
**%g **- real GID of dumped process
**%s **- number of signal causing dump
**%t **- time of dump (seconds since 0:00h, 1 Jan 1970)
**%h **- hostname (same as ’nodename’ returned by uname(2))
**%e **- executable filename

1.2 使用core file调试

以backtrace示例程序为例，此示例会因为段错误(signal 11)使程序崩溃，在开启core dump后(没有配置kernel.core_pattern)，在程序所在目录下生成core.PID文件，如core.4025

[[email protected] :~/test]$gdb ./backtrace core.4025

...省略gdb版本信息输出...

Core was generated by './backtrace'.
Program terminated with signal 11, Segmentation fault.
#0 0x08048720 in ?? ()
(gdb) r
Starting program: /home/yanbaoc/test/backtrace
sigaction register ok
0
This is func_a
This is func_b

Program received signal SIGSEGV, Segmentation fault.
0x0804898f in func_b () at backtrace.c:59
59 printf("%d\n", *p);
(gdb)

使用gdb调试时，可以看到在产生signal 11的代码处，自动插入了断点#0 0x08048720 in ?? ()，当执行run时，可以看到是在代码59行处设置的断点59 printf("%d\n", *p); 那么接下来就知道可能是输出这个指针时出现了无效内存引用，查看一下代码就明确问题了。

2. backtrace

调用backtrace函数可以通过一个指针列表来检查堆栈的每一帧，得到当前进程的调用地址。然后通过backtrace_symbols函数将backtrace得到的信息翻译成字符串。废话不多说，直接看示例。

/***************************************
* backtrace example
* by YYGCui
****************************************/

#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif

#include <stdio.h>
#include <signal.h>
#include <execinfo.h>
#include <ucontext.h>

static void SignalHandler(int num, siginfo_t *info, void *ptr)
{
    FILE *fp;
    void *trace[16];
    char **messages = (char **)NULL;
    int trace_size = 0;
    ucontext_t *ucontext = (ucontext_t *)ptr;
    int i = 0;
    /*
    if ((fp = fopen("bt.txt","a")) == NULL)
    {
        printf("Error to open bt.txt, exit");
        fp = stderr;
    }
    */
    fp = stderr;
    
    fprintf(fp, "-----backtrace-----\n");
    
    /* display general registers, NGREG=19 */
    for (i = 0; i < NGREG; i++)
        fprintf(fp, "reg[%02d] = 0x%08x\n", i, ucontext->uc_mcontext.gregs[i]);

    fprintf(fp, "Segmentation Fault! info.si_code = %d, info.si_addr = %p\n", 
            info->si_code, info->si_addr);

    trace_size = backtrace(trace, 16);
    /* overwrite sigaction with caller's address */
    trace[1] = (void *)ucontext->uc_mcontext.gregs[REG_EIP];
    messages = (char **)backtrace_symbols(trace, trace_size);
    
    fprintf(fp, "[Catch Signal]\t[Execution path]\n");
    for (i = 1; i < trace_size; i++)
        fprintf(fp, "Catch Signal:\t%s\n", messages[i]);

    fclose(fp);
}

int func_a()
{
    printf("This is func_a\n");
    func_b();
}

int func_b()
{
    printf("This is func_b\n");
    /* illegal pointer, si_code = 128 (send by kernel) */
    int *p = (int *)-1;
    printf("%d\n", *p);
}

int main(int argc, char *argv[])
{
    int retVal;
    char c;
    struct sigaction sa;
    sigemptyset(&sa.sa_mask);
    sa.sa_sigaction = SignalHandler;
    sa.sa_flags = SA_RESETHAND | SA_SIGINFO;
    
    retVal = sigaction(SIGSEGV, &sa, NULL);
    
    if (retVal != 0)
    {
        fprintf(stderr, "sigaction register failed (%d).\n", retVal);
    }
    else
    {
        printf("sigaction register ok\n");
    }
    while ((c=getchar()) != '0');
    func_a();
    
    return 0;
}

在该示例中，我们只处理SIGSEGV信号，编译时使用-g -rdynamic (使链接器将所有的符号记录在符号表中)。可以通过backtrace看出区别

[[email protected] :~/test]$./backtrace
sigaction register ok
0
This is func_a
This is func_b
-----backtrace-----
reg[00] = 0x00000033
reg[01] = 0x00000000
reg[02] = 0x0000002b
reg[03] = 0x0000002b
reg[04] = 0x00000001
reg[05] = 0xbfffd084
reg[06] = 0xbfffcf38
reg[07] = 0xbfffcf28
reg[08] = 0x002bde98
reg[09] = 0x0000000f
reg[10] = 0x002bc2a0
reg[11] = 0xffffffff
reg[12] = 0x0000000d
reg[13] = 0x00000000
reg[14] = 0x0804898f
reg[15] = 0x00000023
reg[16] = 0x00010296
reg[17] = 0xbfffcf28
reg[18] = 0x0000002b
Segmentation Fault! info.si_code = 128, info.si_addr = (nil)
[Catch Signal] [Execution path]
Catch Signal: ./backtrace(func_b+0x23) [0x804898f]
Catch Signal: ./backtrace(func_a+0x1b) [0x804896a]
Catch Signal: ./backtrace(main+0x98) [0x8048a38]
Catch Signal: /lib/tls/libc.so.6(__libc_start_main+0xed) [0x19e79d]
Catch Signal: ./backtrace(backtrace_symbols+0x31) [0x80487a9]
Segmentation fault

不加-rdynamic时backtrace如下 (只列出了部分输出)，无法定位问题。

Segmentation Fault! info.si_code = 128, info.si_addr = (nil)
[Catch Signal] [Execution path]
Catch Signal: ./nodybk(backtrace_symbols+0x217) [0x8048663]
Catch Signal: ./nodybk(backtrace_symbols+0x1f2) [0x804863e]
Catch Signal: ./nodybk [0x804870c]
Catch Signal: /lib/tls/libc.so.6(__libc_start_main+0xed) [0x9d979d]
Catch Signal: ./nodybk(backtrace_symbols+0x31) [0x804847d]
Segmentation fault

调试方法有两种：

2.1 使用addr2line命令

将地址信息转化成对应的函数和行号，可直接看到哪行代码出了问题

[[email protected] :~/test]$addr2line 0x804898f -e ./backtrace -f
func_b
/home/yanbaoc/test/backtrace.c:59

2.2 使用GDB调试

使用disassemble命令可以看到崩溃地方的汇编代码，汇编已然看不太懂了，就不在详细说了。

$gdb ./backtrace
(gdb) disassemble func_b+0x23
Dump of assembler code for function func_b:
0x0804896c <func_b+0>: push %ebp
0x0804896d <func_b+1>: mov %esp,%ebp
0x0804896f <func_b+3>: sub $0x8,%esp
0x08048972 <func_b+6>: sub $0xc,%esp
0x08048975 <func_b+9>: push $0x8048bbf
0x0804897a <func_b+14>: call 0x8048748 <printf>
0x0804897f <func_b+19>: add $0x10,%esp
0x08048982 <func_b+22>: movl $0xffffffff,0xfffffffc(%ebp)
0x08048989 <func_b+29>: sub $0x8,%esp
0x0804898c <func_b+32>: mov 0xfffffffc(%ebp),%eax
0x0804898f <func_b+35>: pushl (%eax)
0x08048991 <func_b+37>: push $0x8048bcf
0x08048996 <func_b+42>: call 0x8048748 <printf>
0x0804899b <func_b+47>: add $0x10,%esp
0x0804899e <func_b+50>: leave
0x0804899f <func_b+51>: ret
End of assembler dump.
(gdb)

或者创建崩溃地方处的断点，这也能看出程序哪一行出现了问题

(gdb) break *func_b+0x23
Breakpoint 1 at 0x804898f: file backtrace.c, line 59.
(gdb)

使用list命令可以列出崩溃代码前后总共10行

(gdb) list *func_b+0x23
0x804898f is in func_b (backtrace.c:59).
54     int func_b()
55     {
56         printf("This is func_b\n");
57         /* illegal pointer, si_code = 128 (send by kernel) */
58         int *p = (int *)-1;
59         printf("%d\n", *p);
60     }
61
62     int main(int argc, char *argv[])
63     {
(gdb)

3. 产生SIGSEGV信号的方法

当引用无效的内存或者segmentation fault时，SIGSEGV信号将产生，所以可以直接向该进程发送SIGSEGV信号或者产生内存泄露。虽然可以有多种方法产生SIGSEGV，但是他们的signal code是不一样的。

3.1 使用kill -11，signal code=0 (SI_USER)，当用kill或raise时。

$kill -11 PID

3.2 定义一个非法指针，signal code=128 (SI_KERNEL)，引起内核中断，有内核发出。

int *foo = (int *)-1;
printf("%d\n", *foo);

3.3定义一个空指针并赋值，signal code=1 (SEGV_MAPERR)，地址未映射到具体对象。

int *foo = NULL;
*foo = 0;
printf("%d\n", *foo);

3.4申请一段内存并保护它，signal code=2 (SEGV_ACCERR)，对映射对象无操作权限。

#include <sys/mman.h>

char *foo;
posix_memalign(&foo, getpagesize(), getpagesize());
memset(foo, 'A', getpagesize());
mprotect(foo, getpagesize(), PROT_NONE);
printf("%d\n", foo[0]);

参考：

1. Core dump
1. 1.1 开启core dump
2. 1.2 使用core file调试
2. backtrace
1. 2.1 使用addr2line命令
2. 2.2 使用GDB调试
3. 产生SIGSEGV信号的方法