Monthly Archives: October 2010

Getting the backtrace from a kernel panic

You may know the following situation. You arrive in the morning in the office, do what you always do and check out the latest changes of the software you are working on. After a little bit of compile time and the first coffee you start the just build application. Bumm, kernel panic. After rebooting and locking through the changes you may have an idea what the reason for this could be. A colleague of you is working on a fancy new feature which needed changes to a kernel module. As you almost know nothing about this code you seek for help and, as it of course not happen on his computer, he is asking for a backtrace of this panic. You have two problems now. First you need to see the panic yourself and second it would be nice to get a copy of the backtrace for sharing this info within a bugtracker. In the following post I will show how both aims could be easily archived.

Let the kernel manage the graphical modes

As most people are working under X11 they don’t see the output of an kernel panic. When a kernel panic happens the kernel prints the reason for the panic and a kernel backtrace to the console window and stops immediately its own execution. It is not written into a log file or somewhere else. In consequence you don’t have the ability to look into the panic text, cause the graphical mode is still on. Historically the mode settings are done by the graphic driver of the X11 system. So the kernel has no idea that or which graphic mode is currently in use. Fortunately the kernel hackers invented a new infrastructure which let the kernel do the mode switch. This subsystem is called Kernel-Mode-Settings (KMS). As the kernel do the mode settings, he can switch back to the console on a panic, regardless which graphical mode is currently configured. Beside this, KMS has other improvements like Fast User Switching or a flicker free switch between text and graphic mode. On the other side is this highly hardware dependent and even if it was introduced with version 2.6.28, not all today available hardware can make use of it. If you are an owner of an Intel graphic card you are in good shape. Radeon and NVidia cards have limited support through the in kernel drivers radeonhd and nouveau. For an Intel i915 card you need to enable the following kernel options:

CONFIG_DRM_I915=y
Location:
-> Device Drivers
-> Graphics support
-> Direct Rendering Manager (XFree86 4.1.0 and higher DRI support) (DRM [=y])
-> Intel 830M, 845G, 852GM, 855GM, 865G ( [=y])

CONFIG_DRM_I915_KMS=y
Location:
-> Device Drivers
-> Graphics support
-> Direct Rendering Manager (XFree86 4.1.0 and higher DRI support) (DRM [=y])
-> Intel 830M, 845G, 852GM, 855GM, 865G ( [=y])
-> i915 driver (DRM_I915 [=y])

The kernel line in your favorite boot loader needs the following additional parameter:

i915.modeset=1

X11 should have this minimal configuration for the device section:

Section "Device"
 Identifier    "i915"
 Driver        "intel"
 Option        "DRI"   "true"
EndSection

Please note that you need of course some recent kernel, X11 version and Intel X11 driver to make this work. After a compile, install and boot of the new kernel, KMS should be in use. You will notice it, cause the boot messages will be printed in a much higher graphical resolution, than the usual text mode provide. The next time a kernel panic occurs, the kernel will switch back to the console before the panic is printed. This allows you to see the info printed and maybe you get a useful hint for the reason of the panic.

Post the panic

If you can’t use KMS or don’t want transcribe the panic text by hand into the bugtracker, it would be nice if the text could be made available on another computer. Kernel hackers usual use the serial port for that. Unfortunately most modern computers doesn’t have such a serial port anymore. Also you need two hosts with a serial port and the setup is complex (you have to know about baud-rates, parity and stuff like this). But there is a simpler solution: netconsole. Netconsole is a kernel module, which sends kernel messages anywhere to the net using UDP. The setup is really simple. In the kernel configuration you need the following setting:

CONFIG_NETCONSOLE=m
Location:
-> Device Drivers
-> Network device support (NETDEVICES [=y])

I prefer to compile it as module, which allows me to turn it on only when I need it. Load it with the following command:

modprobe netconsole netconsole=@/,@192.168.220.10/

The ip has to be replaced by the one of your target computer. You can of course tune it much more, like setting source and target ports or even let netconsole send the text to more than one host. On your client you need a network tool which can read from a socket and print the read text to stdout. Netcat or nc are two tools which are able to do just that. The call for nc looks like the following:

nc -l -u 6666

Now if a kernel panic will happen you will see an output like this:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] rb_erase+0x15c/0x320
PGD 6942f067 PUD a1e4067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/virtual/block/md1/dev
CPU 3
Modules linked in: vboxnetadp vboxnetflt vboxdrv netconsole ...

Pid: 18887, comm: VirtualBox Tainted: G        W   2.6.36-gentoo #4 DG33TL/
RIP: 0010:[]  [] rb_erase+0x15c/0x320
RSP: 0018:ffff8800b430db58  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff880069557a68 RCX: 0000000000000001
RDX: ffff880069557a68 RSI: ffff880001d8ed58 RDI: 0000000000000000
RBP: ffff8800b430db68 R08: 0000000000000001 R09: 000000008edcb5d6
R10: 0000000000000000 R11: 0000000000000202 R12: ffff880001d8ed58
R13: 0000000000000000 R14: 000000000000ed00 R15: 0000000000000002
FS:  00007fffde457710(0000) GS:ffff880001d80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000000064f9000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process VirtualBox (pid: 18887, threadinfo ffff8800b430c000, task ffff880091e227f0)
Stack:
 ffff88000a03ba18 ffff880001d8ed48 ffff8800b430dba8 ffffffff8105bf06
<0> ffff8800b430dba8 ffffffff8105c97c ffff8800b430dbc8 ffff88000a03ba18
<0> 00004ff8a86ba455 ffff880001d8ed48 ffff8800b430dc48 ffffffff8105ce77
Call Trace:
 [] __remove_hrtimer+0x36/0xb0
 [] ? lock_hrtimer_base+0x2c/0x60
 [] __hrtimer_start_range_ns+0x2b7/0x3c0
 [] ? rtR0SemEventMultiLnxWait+0x250/0x3d0 [vboxdrv]
 [] ? RTLogLoggerExV+0x12f/0x180 [vboxdrv]
 [] hrtimer_start+0x13/0x20
 [] rtTimerLnxStartSubTimer+0x60/0x120 [vboxdrv]
 [] rtTimerLnxStartOnSpecificCpu+0x21/0x30 [vboxdrv]
 [] rtmpLinuxWrapper+0x23/0x30 [vboxdrv]
 [] RTMpOnSpecific+0x99/0xa0 [vboxdrv]
 [] ? rtTimerLnxStartOnSpecificCpu+0x0/0x30 [vboxdrv]
 [] RTTimerStart+0x2a6/0x2e0 [vboxdrv]
 [] ? g_abExecMemory+0x33665/0x180000 [vboxdrv]
 [] g_abExecMemory+0xc678/0x180000 [vboxdrv]
 [] g_abExecMemory+0x328d7/0x180000 [vboxdrv]
 [] supdrvIOCtlFast+0x6a/0x70 [vboxdrv]
 [] VBoxDrvLinuxIOCtl+0x47/0x1e0 [vboxdrv]
 [] ? pick_next_task_fair+0xde/0x150
 [] do_vfs_ioctl+0xa1/0x590
 [] ? sys_futex+0x76/0x170
 [] sys_ioctl+0x4a/0x80
 [] system_call_fastpath+0x16/0x1b
Code: 07 a8 01 75 9d eb 81 0f 1f 84 00 00 00 00 00 48 3b 78 10 0f 84 ...
RIP  [] rb_erase+0x15c/0x320
 RSP
CR2: 0000000000000000
---[ end trace 4eaa2a86a8e2da24 ]---

Normally only kernel panics are sent to the console. You can increase the verbosity level by executing dmesg -n 8 as root.

Conclusion

To continue with the story from the beginning: With the shown methods you can hope your colleague get enough information to find the reason for the kernel panic. To be more helpful, the next step would be to try to debug the problem yourself. Even if the KGDB was merged into the kernel in version 2.6.35, it is not really usable for me. The reason is that it seems kernel hackers usually have really old hardware which either has a serial port, a PS/2 keyboard or both. Otherwise I can’t find a reason why USB keyboards don’t work. I asked on the mailing list of KGDB about the status of USB keyboard support and I can only hope support will be integrated soon.

Using suppression files with Valgrind

Valgrind is one of the great tools in the long list of freely available applications for development. Beside several profiling tools it also contains a memory checker. Leaking memory is one of the more common errors a programmer could step into. Basically it means to forget freeing memory (or in a more general sense: any resource) a program has acquired. If you are a perfect developer, this will never happen to you. If you are a good developer it may happen and that’s where Valgrind will save you some trouble. As most of the developers out there are more or less good developers, their programs produce memory leaks, too ;). The right solution for this, is of course to write a bug report. But there are times where this isn’t possible or you are in hurry and don’t want to see all the errors of a third-party library you link against.

In the following post, I will show how to suppress such unwanted error messages to make it much more easier to analyze the output of Valgrind for your own application.

Installing Valgrind

On Mac OS X you can use MacPorts to install Valgrind. You have to use valgrind-devel if you are on Snow Leopard, because Snow Leopard is supported in the current development version only. It’s as simply as typing sudo port install valgrind-devel.

On Gentoo it can become a bit harder. The current stable version is 3.5 (like in MacPorts). If you try this version (at least on an unstable Gentoo like mine) with valgrind ls, you will get the following error:

valgrind:  Fatal error at startup: a function redirection
valgrind:  which is mandatory for this platform-tool combination
valgrind:  cannot be set up.  Details of the redirection are:
valgrind:
valgrind:  A must-be-redirected function
valgrind:  whose name matches the pattern:      strlen
valgrind:  in an object with soname matching:   ld-linux-x86-64.so.2
valgrind:  was not found whilst processing
valgrind:  symbols from the object with soname: ld-linux-x86-64.so.2
valgrind:
valgrind:  Possible fixes: (1, short term): install glibc's debuginfo
valgrind:  package on this machine.  (2, longer term): ask the packagers
valgrind:  for your Linux distribution to please in future ship a non-
valgrind:  stripped ld.so (or whatever the dynamic linker .so is called)
valgrind:  that exports the above-named function using the standard
valgrind:  calling conventions for this platform.
valgrind:
valgrind:  Cannot continue -- exiting now.  Sorry.

The reason is a striped glibc. To work properly, Valgrind needs to overwrite some of the system functions the glibc provide. It does this by getting the symbols by name out of this library. This is of course not possible if all the symbol names are removed. You can prove this by executing nm /lib/ld-linux-x86-64.so.2. Gentoo provides a FEATURE=splitdebug, which adds debug libraries to the installation. Unfortunately setting this feature in /etc/make.conf, means setting it global. Gentoo is known as being configurable as no other distribution out there and of course we can set a feature for one program only. To do so, create a file called glibc in /etc/portage/env/sys-libs/ and add the following content to it.

FEATURES="splitdebug"

After a rebuild of glibc by executing emerge --oneshot glibc, we have a working Valgrind.

As all programs, Valgrind isn’t perfect. Version 3.5 shows many false/positive hits on my system, but fortunately the development goes on. Currently there is no newer version available in the Gentoo tree. Anyway it is not necessary to build one yourself, to get a more recent version. Using layman and the overlay tree of Flameeyes will let you integrate the development version of Valgrind seamlessly into your system. For a general How-to of layman check out this Users’ guide. In short, something like the following should be sufficient:

layman -a flameeyes-overlay
layman -s flameeyes-overlay
echo "=dev-util/valgrind-9999 **" >> /etc/portage/package.keywords
emerge valgrind

Installing the development version of Valgrind is optional of course.

Know your tools

One usage of Valgrind could be look like this:

valgrind --leak-check=full --leak-resolution=high ./VirtualBox

Beside other errors it also shows this error message on my system:

==27174==    at 0x4C26C09: memalign (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==27174==    by 0x4C26CB9: posix_memalign (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==27174==    by 0xBA8967F: ??? (in /usr/lib64/libglib-2.0.so.0.2400.2)
==27174==    by 0xBA89E9D: g_slice_alloc (in /usr/lib64/libglib-2.0.so.0.2400.2)
==27174==    by 0xBA89F86: g_slice_alloc0 (in /usr/lib64/libglib-2.0.so.0.2400.2)
==27174==    by 0xC204847: g_type_create_instance (in /usr/lib64/libgobject-2.0.so.0.2400.2)
==27174==    by 0xC1EB8A5: ??? (in /usr/lib64/libgobject-2.0.so.0.2400.2)
==27174==    by 0xC1ECE5D: g_object_newv (in /usr/lib64/libgobject-2.0.so.0.2400.2)
==27174==    by 0xC1ED494: g_object_new (in /usr/lib64/libgobject-2.0.so.0.2400.2)
==27174==    by 0x72A495F: ??? (in /usr/lib64/qt4/libQtGui.so.4.6.3)
==27174==    by 0x72A0D4F: ??? (in /usr/lib64/qt4/libQtGui.so.4.6.3)
==27174==    by 0x7289264: QGtkStyle::QGtkStyle() (in /usr/lib64/qt4/libQtGui.so.4.6.3)
==27174==    by 0x7215DB6: QStyleFactory::create(QString const&) (in /usr/lib64/qt4/libQtGui.so.4.6.3)
==27174==    by 0x6F5B7FC: QApplication::style() (in /usr/lib64/qt4/libQtGui.so.4.6.3)
==27174==    by 0x6F61DFF: QApplicationPrivate::initialize() (in /usr/lib64/qt4/libQtGui.so.4.6.3)
==27174==    by 0x6F61E88: QApplicationPrivate::construct(_XDisplay*, unsigned long, unsigned long) (in /usr/lib64/qt4/libQtGui.so.4.6.3)
==27174==    by 0x6F61FF3: QApplication::QApplication(_XDisplay*, int&, char**, unsigned long, unsigned long, int) (in /usr/lib64/qt4/libQtGui.so.4.6.3)
==27174==    by 0x44BC38: TrustedMain (main.cpp:371)
==27174==    by 0x44C649: main (main.cpp:651)

If you analyze the backtrace, you see that something in libQtGui is leaking memory. I don’t want to blame someone for it or make a statement if this is right or wrong, I just want to get rid of it, to be able to easily spot errors VirtualBox itself produce. To do so, add --gen-suppressions=all to the Valgrind call. This will produce something similar like this:

{
 Memcheck:Leak
 fun:memalign
 fun:posix_memalign
 obj:/usr/lib64/libglib-2.0.so.0.2400.2
 fun:g_slice_alloc
 fun:g_slice_alloc0
 fun:g_type_create_instance
 obj:/usr/lib64/libgobject-2.0.so.0.2400.2
 fun:g_object_newv
 fun:g_object_new
 obj:/usr/lib64/qt4/libQtGui.so.4.6.3
 obj:/usr/lib64/qt4/libQtGui.so.4.6.3
 fun:_ZN9QGtkStyleC1Ev
 fun:_ZN13QStyleFactory6createERK7QString
 fun:_ZN12QApplication5styleEv
 fun:_ZN19QApplicationPrivate10initializeEv
 fun:_ZN19QApplicationPrivate9constructEP9_XDisplaymm
 fun:_ZN12QApplicationC1EP9_XDisplayRiPPcmmi
 fun:TrustedMain
 fun:main
}

To let Valgrind ignore this error in the future, copy the text into a file vbox.supp and start Valgrind with --suppressions=vbox.supp. Viola, this specific error isn’t shown anymore. The format used there is easy to understand and you can of course tweak this much more. E.g. you could replace some of the fun: entries by “...“. This is a placeholder for one or more functions calls with any name. Beside making suppression rules more general you can of course add as much as you like. Adding a name at the top make it easy to identify the different rules. For all the possibilities have a look at the documentation. Just for the curious, Valgrind is using such a file itself. Have a look at /usr/lib/valgrind/default.supp. You may also have noted that the function names in the normal error message differ from the one in the suppression list. The former is in the demangled form and the later in the saved form. You could force Valgrind to print mangled function names by adding the --demangle=no parameter to the call. This becomes handy if you manually create suppression lists.

Conclusion

By using suppression rules for the own application, unimportant errors could be eliminated in the output of Valgrind. With this is in mind there is no excuse anymore for memory leaks in the self developed applications. Beside memory leaks, Valgrind also finds places where uninitialized variables are in use or where memory is used which isn’t allocated by the application. Also these tests could be filtered out by suppression rules.