Cc: Vegard Nossum <vegardno@xxxxxxxxxx> Cc: Pekka Enberg <penberg@xxxxxxxxxx> Signed-off-by: Jonathan Corbet <corbet@xxxxxxx> --- Documentation/dev-tools/kmemcheck.rst | 733 +++++++++++++++++++++++++++++++++ Documentation/dev-tools/tools.rst | 1 + Documentation/kmemcheck.txt | 754 ---------------------------------- MAINTAINERS | 2 +- 4 files changed, 735 insertions(+), 755 deletions(-) create mode 100644 Documentation/dev-tools/kmemcheck.rst delete mode 100644 Documentation/kmemcheck.txt diff --git a/Documentation/dev-tools/kmemcheck.rst b/Documentation/dev-tools/kmemcheck.rst new file mode 100644 index 0000000..778f515 --- /dev/null +++ b/Documentation/dev-tools/kmemcheck.rst @@ -0,0 +1,733 @@ +Getting started with kmemcheck +============================== + +Vegard Nossum <vegardno@xxxxxxxxxx> + + +Introduction +------------ + +kmemcheck is a debugging feature for the Linux Kernel. More specifically, it +is a dynamic checker that detects and warns about some uses of uninitialized +memory. + +Userspace programmers might be familiar with Valgrind's memcheck. The main +difference between memcheck and kmemcheck is that memcheck works for userspace +programs only, and kmemcheck works for the kernel only. The implementations +are of course vastly different. Because of this, kmemcheck is not as accurate +as memcheck, but it turns out to be good enough in practice to discover real +programmer errors that the compiler is not able to find through static +analysis. + +Enabling kmemcheck on a kernel will probably slow it down to the extent that +the machine will not be usable for normal workloads such as e.g. an +interactive desktop. kmemcheck will also cause the kernel to use about twice +as much memory as normal. For this reason, kmemcheck is strictly a debugging +feature. + + +Downloading +----------- + +As of version 2.6.31-rc1, kmemcheck is included in the mainline kernel. + + +Configuring and compiling +------------------------- + +kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of +configuration variables must have specific settings in order for the kmemcheck +menu to even appear in "menuconfig". These are: + +- ``CONFIG_CC_OPTIMIZE_FOR_SIZE=n`` + This option is located under "General setup" / "Optimize for size". + + Without this, gcc will use certain optimizations that usually lead to + false positive warnings from kmemcheck. An example of this is a 16-bit + field in a struct, where gcc may load 32 bits, then discard the upper + 16 bits. kmemcheck sees only the 32-bit load, and may trigger a + warning for the upper 16 bits (if they're uninitialized). + +- ``CONFIG_SLAB=y`` or ``CONFIG_SLUB=y`` + This option is located under "General setup" / "Choose SLAB + allocator". + +- ``CONFIG_FUNCTION_TRACER=n`` + This option is located under "Kernel hacking" / "Tracers" / "Kernel + Function Tracer" + + When function tracing is compiled in, gcc emits a call to another + function at the beginning of every function. This means that when the + page fault handler is called, the ftrace framework will be called + before kmemcheck has had a chance to handle the fault. If ftrace then + modifies memory that was tracked by kmemcheck, the result is an + endless recursive page fault. + +- ``CONFIG_DEBUG_PAGEALLOC=n`` + This option is located under "Kernel hacking" / "Memory Debugging" + / "Debug page memory allocations". + +In addition, I highly recommend turning on ``CONFIG_DEBUG_INFO=y``. This is also +located under "Kernel hacking". With this, you will be able to get line number +information from the kmemcheck warnings, which is extremely valuable in +debugging a problem. This option is not mandatory, however, because it slows +down the compilation process and produces a much bigger kernel image. + +Now the kmemcheck menu should be visible (under "Kernel hacking" / "Memory +Debugging" / "kmemcheck: trap use of uninitialized memory"). Here follows +a description of the kmemcheck configuration variables: + +- ``CONFIG_KMEMCHECK`` + This must be enabled in order to use kmemcheck at all... + +- ``CONFIG_KMEMCHECK_``[``DISABLED`` | ``ENABLED`` | ``ONESHOT``]``_BY_DEFAULT`` + This option controls the status of kmemcheck at boot-time. "Enabled" + will enable kmemcheck right from the start, "disabled" will boot the + kernel as normal (but with the kmemcheck code compiled in, so it can + be enabled at run-time after the kernel has booted), and "one-shot" is + a special mode which will turn kmemcheck off automatically after + detecting the first use of uninitialized memory. + + If you are using kmemcheck to actively debug a problem, then you + probably want to choose "enabled" here. + + The one-shot mode is mostly useful in automated test setups because it + can prevent floods of warnings and increase the chances of the machine + surviving in case something is really wrong. In other cases, the one- + shot mode could actually be counter-productive because it would turn + itself off at the very first error -- in the case of a false positive + too -- and this would come in the way of debugging the specific + problem you were interested in. + + If you would like to use your kernel as normal, but with a chance to + enable kmemcheck in case of some problem, it might be a good idea to + choose "disabled" here. When kmemcheck is disabled, most of the run- + time overhead is not incurred, and the kernel will be almost as fast + as normal. + +- ``CONFIG_KMEMCHECK_QUEUE_SIZE`` + Select the maximum number of error reports to store in an internal + (fixed-size) buffer. Since errors can occur virtually anywhere and in + any context, we need a temporary storage area which is guaranteed not + to generate any other page faults when accessed. The queue will be + emptied as soon as a tasklet may be scheduled. If the queue is full, + new error reports will be lost. + + The default value of 64 is probably fine. If some code produces more + than 64 errors within an irqs-off section, then the code is likely to + produce many, many more, too, and these additional reports seldom give + any more information (the first report is usually the most valuable + anyway). + + This number might have to be adjusted if you are not using serial + console or similar to capture the kernel log. If you are using the + "dmesg" command to save the log, then getting a lot of kmemcheck + warnings might overflow the kernel log itself, and the earlier reports + will get lost in that way instead. Try setting this to 10 or so on + such a setup. + +- ``CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT`` + Select the number of shadow bytes to save along with each entry of the + error-report queue. These bytes indicate what parts of an allocation + are initialized, uninitialized, etc. and will be displayed when an + error is detected to help the debugging of a particular problem. + + The number entered here is actually the logarithm of the number of + bytes that will be saved. So if you pick for example 5 here, kmemcheck + will save 2^5 = 32 bytes. + + The default value should be fine for debugging most problems. It also + fits nicely within 80 columns. + +- ``CONFIG_KMEMCHECK_PARTIAL_OK`` + This option (when enabled) works around certain GCC optimizations that + produce 32-bit reads from 16-bit variables where the upper 16 bits are + thrown away afterwards. + + The default value (enabled) is recommended. This may of course hide + some real errors, but disabling it would probably produce a lot of + false positives. + +- ``CONFIG_KMEMCHECK_BITOPS_OK`` + This option silences warnings that would be generated for bit-field + accesses where not all the bits are initialized at the same time. This + may also hide some real bugs. + + This option is probably obsolete, or it should be replaced with + the kmemcheck-/bitfield-annotations for the code in question. The + default value is therefore fine. + +Now compile the kernel as usual. + + +How to use +---------- + +Booting +~~~~~~~ + +First some information about the command-line options. There is only one +option specific to kmemcheck, and this is called "kmemcheck". It can be used +to override the default mode as chosen by the ``CONFIG_KMEMCHECK_*_BY_DEFAULT`` +option. Its possible settings are: + +- ``kmemcheck=0`` (disabled) +- ``kmemcheck=1`` (enabled) +- ``kmemcheck=2`` (one-shot mode) + +If SLUB debugging has been enabled in the kernel, it may take precedence over +kmemcheck in such a way that the slab caches which are under SLUB debugging +will not be tracked by kmemcheck. In order to ensure that this doesn't happen +(even though it shouldn't by default), use SLUB's boot option ``slub_debug``, +like this: ``slub_debug=-`` + +In fact, this option may also be used for fine-grained control over SLUB vs. +kmemcheck. For example, if the command line includes +``kmemcheck=1 slub_debug=,dentry``, then SLUB debugging will be used only +for the "dentry" slab cache, and with kmemcheck tracking all the other +caches. This is advanced usage, however, and is not generally recommended. + + +Run-time enable/disable +~~~~~~~~~~~~~~~~~~~~~~~ + +When the kernel has booted, it is possible to enable or disable kmemcheck at +run-time. WARNING: This feature is still experimental and may cause false +positive warnings to appear. Therefore, try not to use this. If you find that +it doesn't work properly (e.g. you see an unreasonable amount of warnings), I +will be happy to take bug reports. + +Use the file ``/proc/sys/kernel/kmemcheck`` for this purpose, e.g.:: + + $ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck + +The numbers are the same as for the ``kmemcheck=`` command-line option. + + +Debugging +~~~~~~~~~ + +A typical report will look something like this:: + + WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) + 80000000000000000000000000000000000000000088ffff0000000000000000 + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u + ^ + + Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A + RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 + RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002 + RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 + RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84 + RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000 + R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e + R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8 + FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000 + CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 + CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0 + DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 + DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 + [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 + [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 + [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 + [<ffffffff8100c7b5>] int_signal+0x12/0x17 + [<ffffffffffffffff>] 0xffffffffffffffff + +The single most valuable information in this report is the RIP (or EIP on 32- +bit) value. This will help us pinpoint exactly which instruction that caused +the warning. + +If your kernel was compiled with ``CONFIG_DEBUG_INFO=y``, then all we have to do +is give this address to the addr2line program, like this:: + + $ addr2line -e vmlinux -i ffffffff8104ede8 + arch/x86/include/asm/string_64.h:12 + include/asm-generic/siginfo.h:287 + kernel/signal.c:380 + kernel/signal.c:410 + +The "``-e vmlinux``" tells addr2line which file to look in. **IMPORTANT:** +This must be the vmlinux of the kernel that produced the warning in the +first place! If not, the line number information will almost certainly be +wrong. + +The "``-i``" tells addr2line to also print the line numbers of inlined +functions. In this case, the flag was very important, because otherwise, +it would only have printed the first line, which is just a call to +``memcpy()``, which could be called from a thousand places in the kernel, and +is therefore not very useful. These inlined functions would not show up in +the stack trace above, simply because the kernel doesn't load the extra +debugging information. This technique can of course be used with ordinary +kernel oopses as well. + +In this case, it's the caller of ``memcpy()`` that is interesting, and it can be +found in ``include/asm-generic/siginfo.h``, line 287:: + + 281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from) + 282 { + 283 if (from->si_code < 0) + 284 memcpy(to, from, sizeof(*to)); + 285 else + 286 /* _sigchld is currently the largest know union member */ + 287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld)); + 288 } + +Since this was a read (kmemcheck usually warns about reads only, though it can +warn about writes to unallocated or freed memory as well), it was probably the +"from" argument which contained some uninitialized bytes. Following the chain +of calls, we move upwards to see where "from" was allocated or initialized, +``kernel/signal.c``, line 380:: + + 359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info) + 360 { + ... + 367 list_for_each_entry(q, &list->list, list) { + 368 if (q->info.si_signo == sig) { + 369 if (first) + 370 goto still_pending; + 371 first = q; + ... + 377 if (first) { + 378 still_pending: + 379 list_del_init(&first->list); + 380 copy_siginfo(info, &first->info); + 381 __sigqueue_free(first); + ... + 392 } + 393 } + +Here, it is ``&first->info`` that is being passed on to ``copy_siginfo()``. The +variable ``first`` was found on a list -- passed in as the second argument to +``collect_signal()``. We continue our journey through the stack, to figure out +where the item on "list" was allocated or initialized. We move to line 410:: + + 395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask, + 396 siginfo_t *info) + 397 { + ... + 410 collect_signal(sig, pending, info); + ... + 414 } + +Now we need to follow the ``pending`` pointer, since that is being passed on to +``collect_signal()`` as ``list``. At this point, we've run out of lines from the +"addr2line" output. Not to worry, we just paste the next addresses from the +kmemcheck stack dump, i.e.:: + + [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 + [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 + [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 + [<ffffffff8100c7b5>] int_signal+0x12/0x17 + + $ addr2line -e vmlinux -i ffffffff8104f04e ffffffff81050bd8 \ + ffffffff8100b87d ffffffff8100c7b5 + kernel/signal.c:446 + kernel/signal.c:1806 + arch/x86/kernel/signal.c:805 + arch/x86/kernel/signal.c:871 + arch/x86/kernel/entry_64.S:694 + +Remember that since these addresses were found on the stack and not as the +RIP value, they actually point to the _next_ instruction (they are return +addresses). This becomes obvious when we look at the code for line 446:: + + 422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) + 423 { + ... + 431 signr = __dequeue_signal(&tsk->signal->shared_pending, + 432 mask, info); + 433 /* + 434 * itimer signal ? + 435 * + 436 * itimers are process shared and we restart periodic + 437 * itimers in the signal delivery path to prevent DoS + 438 * attacks in the high resolution timer case. This is + 439 * compliant with the old way of self restarting + 440 * itimers, as the SIGALRM is a legacy signal and only + 441 * queued once. Changing the restart behaviour to + 442 * restart the timer in the signal dequeue path is + 443 * reducing the timer noise on heavy loaded !highres + 444 * systems too. + 445 */ + 446 if (unlikely(signr == SIGALRM)) { + ... + 489 } + +So instead of looking at 446, we should be looking at 431, which is the line +that executes just before 446. Here we see that what we are looking for is +``&tsk->signal->shared_pending``. + +Our next task is now to figure out which function that puts items on this +``shared_pending`` list. A crude, but efficient tool, is ``git grep``:: + + $ git grep -n 'shared_pending' kernel/ + ... + kernel/signal.c:828: pending = group ? &t->signal->shared_pending : &t->pending; + kernel/signal.c:1339: pending = group ? &t->signal->shared_pending : &t->pending; + ... + +There were more results, but none of them were related to list operations, +and these were the only assignments. We inspect the line numbers more closely +and find that this is indeed where items are being added to the list:: + + 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, + 817 int group) + 818 { + ... + 828 pending = group ? &t->signal->shared_pending : &t->pending; + ... + 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && + 852 (is_si_special(info) || + 853 info->si_code >= 0))); + 854 if (q) { + 855 list_add_tail(&q->list, &pending->list); + ... + 890 } + +and:: + + 1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group) + 1310 { + .... + 1339 pending = group ? &t->signal->shared_pending : &t->pending; + 1340 list_add_tail(&q->list, &pending->list); + .... + 1347 } + +In the first case, the list element we are looking for, ``q``, is being +returned from the function ``__sigqueue_alloc()``, which looks like an +allocation function. Let's take a look at it:: + + 187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, + 188 int override_rlimit) + 189 { + 190 struct sigqueue *q = NULL; + 191 struct user_struct *user; + 192 + 193 /* + 194 * We won't get problems with the target's UID changing under us + 195 * because changing it requires RCU be used, and if t != current, the + 196 * caller must be holding the RCU readlock (by way of a spinlock) and + 197 * we use RCU protection here + 198 */ + 199 user = get_uid(__task_cred(t)->user); + 200 atomic_inc(&user->sigpending); + 201 if (override_rlimit || + 202 atomic_read(&user->sigpending) <= + 203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur) + 204 q = kmem_cache_alloc(sigqueue_cachep, flags); + 205 if (unlikely(q == NULL)) { + 206 atomic_dec(&user->sigpending); + 207 free_uid(user); + 208 } else { + 209 INIT_LIST_HEAD(&q->list); + 210 q->flags = 0; + 211 q->user = user; + 212 } + 213 + 214 return q; + 215 } + +We see that this function initializes ``q->list``, ``q->flags``, and +``q->user``. It seems that now is the time to look at the definition of +``struct sigqueue``, e.g.:: + + 14 struct sigqueue { + 15 struct list_head list; + 16 int flags; + 17 siginfo_t info; + 18 struct user_struct *user; + 19 }; + +And, you might remember, it was a ``memcpy()`` on ``&first->info`` that +caused the warning, so this makes perfect sense. It also seems reasonable +to assume that it is the caller of ``__sigqueue_alloc()`` that has the +responsibility of filling out (initializing) this member. + +But just which fields of the struct were uninitialized? Let's look at +kmemcheck's report again:: + + WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) + 80000000000000000000000000000000000000000088ffff0000000000000000 + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u + ^ + +These first two lines are the memory dump of the memory object itself, and +the shadow bytemap, respectively. The memory object itself is in this case +``&first->info``. Just beware that the start of this dump is NOT the start +of the object itself! The position of the caret (^) corresponds with the +address of the read (ffff88003e4a2024). + +The shadow bytemap dump legend is as follows: + +- i: initialized +- u: uninitialized +- a: unallocated (memory has been allocated by the slab layer, but has not + yet been handed off to anybody) +- f: freed (memory has been allocated by the slab layer, but has been freed + by the previous owner) + +In order to figure out where (relative to the start of the object) the +uninitialized memory was located, we have to look at the disassembly. For +that, we'll need the RIP address again:: + + RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 + + $ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8: + ffffffff8104edc8: mov %r8,0x8(%r8) + ffffffff8104edcc: test %r10d,%r10d + ffffffff8104edcf: js ffffffff8104ee88 <__dequeue_signal+0x168> + ffffffff8104edd5: mov %rax,%rdx + ffffffff8104edd8: mov $0xc,%ecx + ffffffff8104eddd: mov %r13,%rdi + ffffffff8104ede0: mov $0x30,%eax + ffffffff8104ede5: mov %rdx,%rsi + ffffffff8104ede8: rep movsl %ds:(%rsi),%es:(%rdi) + ffffffff8104edea: test $0x2,%al + ffffffff8104edec: je ffffffff8104edf0 <__dequeue_signal+0xd0> + ffffffff8104edee: movsw %ds:(%rsi),%es:(%rdi) + ffffffff8104edf0: test $0x1,%al + ffffffff8104edf2: je ffffffff8104edf5 <__dequeue_signal+0xd5> + ffffffff8104edf4: movsb %ds:(%rsi),%es:(%rdi) + ffffffff8104edf5: mov %r8,%rdi + ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free> + +As expected, it's the "``rep movsl``" instruction from the ``memcpy()`` +that causes the warning. We know about ``REP MOVSL`` that it uses the register +``RCX`` to count the number of remaining iterations. By taking a look at the +register dump again (from the kmemcheck report), we can figure out how many +bytes were left to copy:: + + RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 + +By looking at the disassembly, we also see that ``%ecx`` is being loaded +with the value ``$0xc`` just before (ffffffff8104edd8), so we are very +lucky. Keep in mind that this is the number of iterations, not bytes. And +since this is a "long" operation, we need to multiply by 4 to get the +number of bytes. So this means that the uninitialized value was encountered +at 4 * (0xc - 0x9) = 12 bytes from the start of the object. + +We can now try to figure out which field of the "``struct siginfo``" that +was not initialized. This is the beginning of the struct:: + + 40 typedef struct siginfo { + 41 int si_signo; + 42 int si_errno; + 43 int si_code; + 44 + 45 union { + .. + 92 } _sifields; + 93 } siginfo_t; + +On 64-bit, the int is 4 bytes long, so it must the union member that has +not been initialized. We can verify this using gdb:: + + $ gdb vmlinux + ... + (gdb) p &((struct siginfo *) 0)->_sifields + $1 = (union {...} *) 0x10 + +Actually, it seems that the union member is located at offset 0x10 -- which +means that gcc has inserted 4 bytes of padding between the members ``si_code`` +and ``_sifields``. We can now get a fuller picture of the memory dump:: + + _----------------------------=> si_code + / _--------------------=> (padding) + | / _------------=> _sifields(._kill._pid) + | | / _----=> _sifields(._kill._uid) + | | | / + -------|-------|-------|-------| + 80000000000000000000000000000000000000000088ffff0000000000000000 + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u + +This allows us to realize another important fact: ``si_code`` contains the +value 0x80. Remember that x86 is little endian, so the first 4 bytes +"80000000" are really the number 0x00000080. With a bit of research, we +find that this is actually the constant ``SI_KERNEL`` defined in +``include/asm-generic/siginfo.h``:: + + 144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */ + +This macro is used in exactly one place in the x86 kernel: In ``send_signal()`` +in ``kernel/signal.c``:: + + 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, + 817 int group) + 818 { + ... + 828 pending = group ? &t->signal->shared_pending : &t->pending; + ... + 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && + 852 (is_si_special(info) || + 853 info->si_code >= 0))); + 854 if (q) { + 855 list_add_tail(&q->list, &pending->list); + 856 switch ((unsigned long) info) { + ... + 865 case (unsigned long) SEND_SIG_PRIV: + 866 q->info.si_signo = sig; + 867 q->info.si_errno = 0; + 868 q->info.si_code = SI_KERNEL; + 869 q->info.si_pid = 0; + 870 q->info.si_uid = 0; + 871 break; + ... + 890 } + +Not only does this match with the ``.si_code`` member, it also matches the place +we found earlier when looking for where siginfo_t objects are enqueued on the +``shared_pending`` list. + +So to sum up: It seems that it is the padding introduced by the compiler +between two struct fields that is uninitialized, and this gets reported when +we do a ``memcpy()`` on the struct. This means that we have identified a false +positive warning. + +Normally, kmemcheck will not report uninitialized accesses in ``memcpy()`` calls +when both the source and destination addresses are tracked. (Instead, we copy +the shadow bytemap as well). In this case, the destination address clearly +was not tracked. We can dig a little deeper into the stack trace from above:: + + arch/x86/kernel/signal.c:805 + arch/x86/kernel/signal.c:871 + arch/x86/kernel/entry_64.S:694 + +And we clearly see that the destination siginfo object is located on the +stack:: + + 782 static void do_signal(struct pt_regs *regs) + 783 { + 784 struct k_sigaction ka; + 785 siginfo_t info; + ... + 804 signr = get_signal_to_deliver(&info, &ka, regs, NULL); + ... + 854 } + +And this ``&info`` is what eventually gets passed to ``copy_siginfo()`` as the +destination argument. + +Now, even though we didn't find an actual error here, the example is still a +good one, because it shows how one would go about to find out what the report +was all about. + + +Annotating false positives +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are a few different ways to make annotations in the source code that +will keep kmemcheck from checking and reporting certain allocations. Here +they are: + +- ``__GFP_NOTRACK_FALSE_POSITIVE`` + This flag can be passed to ``kmalloc()`` or ``kmem_cache_alloc()`` + (therefore also to other functions that end up calling one of + these) to indicate that the allocation should not be tracked + because it would lead to a false positive report. This is a "big + hammer" way of silencing kmemcheck; after all, even if the false + positive pertains to particular field in a struct, for example, we + will now lose the ability to find (real) errors in other parts of + the same struct. + + Example:: + + /* No warnings will ever trigger on accessing any part of x */ + x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE); + +- ``kmemcheck_bitfield_begin(name)``/``kmemcheck_bitfield_end(name)`` and + ``kmemcheck_annotate_bitfield(ptr, name)`` + The first two of these three macros can be used inside struct + definitions to signal, respectively, the beginning and end of a + bitfield. Additionally, this will assign the bitfield a name, which + is given as an argument to the macros. + + Having used these markers, one can later use + kmemcheck_annotate_bitfield() at the point of allocation, to indicate + which parts of the allocation is part of a bitfield. + + Example:: + + struct foo { + int x; + + kmemcheck_bitfield_begin(flags); + int flag_a:1; + int flag_b:1; + kmemcheck_bitfield_end(flags); + + int y; + }; + + struct foo *x = kmalloc(sizeof *x); + + /* No warnings will trigger on accessing the bitfield of x */ + kmemcheck_annotate_bitfield(x, flags); + + Note that ``kmemcheck_annotate_bitfield()`` can be used even before the + return value of ``kmalloc()`` is checked -- in other words, passing NULL + as the first argument is legal (and will do nothing). + + +Reporting errors +---------------- + +As we have seen, kmemcheck will produce false positive reports. Therefore, it +is not very wise to blindly post kmemcheck warnings to mailing lists and +maintainers. Instead, I encourage maintainers and developers to find errors +in their own code. If you get a warning, you can try to work around it, try +to figure out if it's a real error or not, or simply ignore it. Most +developers know their own code and will quickly and efficiently determine the +root cause of a kmemcheck report. This is therefore also the most efficient +way to work with kmemcheck. + +That said, we (the kmemcheck maintainers) will always be on the lookout for +false positives that we can annotate and silence. So whatever you find, +please drop us a note privately! Kernel configs and steps to reproduce (if +available) are of course a great help too. + +Happy hacking! + + +Technical description +--------------------- + +kmemcheck works by marking memory pages non-present. This means that whenever +somebody attempts to access the page, a page fault is generated. The page +fault handler notices that the page was in fact only hidden, and so it calls +on the kmemcheck code to make further investigations. + +When the investigations are completed, kmemcheck "shows" the page by marking +it present (as it would be under normal circumstances). This way, the +interrupted code can continue as usual. + +But after the instruction has been executed, we should hide the page again, so +that we can catch the next access too! Now kmemcheck makes use of a debugging +feature of the processor, namely single-stepping. When the processor has +finished the one instruction that generated the memory access, a debug +exception is raised. From here, we simply hide the page again and continue +execution, this time with the single-stepping feature turned off. + +kmemcheck requires some assistance from the memory allocator in order to work. +The memory allocator needs to + + 1. Tell kmemcheck about newly allocated pages and pages that are about to + be freed. This allows kmemcheck to set up and tear down the shadow memory + for the pages in question. The shadow memory stores the status of each + byte in the allocation proper, e.g. whether it is initialized or + uninitialized. + + 2. Tell kmemcheck which parts of memory should be marked uninitialized. + There are actually a few more states, such as "not yet allocated" and + "recently freed". + +If a slab cache is set up using the SLAB_NOTRACK flag, it will never return +memory that can take page faults because of kmemcheck. + +If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still +request memory with the __GFP_NOTRACK or __GFP_NOTRACK_FALSE_POSITIVE flags. +This does not prevent the page faults from occurring, however, but marks the +object in question as being initialized so that no warnings will ever be +produced for this object. + +Currently, the SLAB and SLUB allocators are supported by kmemcheck. diff --git a/Documentation/dev-tools/tools.rst b/Documentation/dev-tools/tools.rst index 3b6382a..43f7dee 100644 --- a/Documentation/dev-tools/tools.rst +++ b/Documentation/dev-tools/tools.rst @@ -21,3 +21,4 @@ whole; patches welcome! kasan ubsan kmemleak + kmemcheck diff --git a/Documentation/kmemcheck.txt b/Documentation/kmemcheck.txt deleted file mode 100644 index 80aae85..0000000 --- a/Documentation/kmemcheck.txt +++ /dev/null @@ -1,754 +0,0 @@ -GETTING STARTED WITH KMEMCHECK -============================== - -Vegard Nossum <vegardno@xxxxxxxxxx> - - -Contents -======== -0. Introduction -1. Downloading -2. Configuring and compiling -3. How to use -3.1. Booting -3.2. Run-time enable/disable -3.3. Debugging -3.4. Annotating false positives -4. Reporting errors -5. Technical description - - -0. Introduction -=============== - -kmemcheck is a debugging feature for the Linux Kernel. More specifically, it -is a dynamic checker that detects and warns about some uses of uninitialized -memory. - -Userspace programmers might be familiar with Valgrind's memcheck. The main -difference between memcheck and kmemcheck is that memcheck works for userspace -programs only, and kmemcheck works for the kernel only. The implementations -are of course vastly different. Because of this, kmemcheck is not as accurate -as memcheck, but it turns out to be good enough in practice to discover real -programmer errors that the compiler is not able to find through static -analysis. - -Enabling kmemcheck on a kernel will probably slow it down to the extent that -the machine will not be usable for normal workloads such as e.g. an -interactive desktop. kmemcheck will also cause the kernel to use about twice -as much memory as normal. For this reason, kmemcheck is strictly a debugging -feature. - - -1. Downloading -============== - -As of version 2.6.31-rc1, kmemcheck is included in the mainline kernel. - - -2. Configuring and compiling -============================ - -kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of -configuration variables must have specific settings in order for the kmemcheck -menu to even appear in "menuconfig". These are: - - o CONFIG_CC_OPTIMIZE_FOR_SIZE=n - - This option is located under "General setup" / "Optimize for size". - - Without this, gcc will use certain optimizations that usually lead to - false positive warnings from kmemcheck. An example of this is a 16-bit - field in a struct, where gcc may load 32 bits, then discard the upper - 16 bits. kmemcheck sees only the 32-bit load, and may trigger a - warning for the upper 16 bits (if they're uninitialized). - - o CONFIG_SLAB=y or CONFIG_SLUB=y - - This option is located under "General setup" / "Choose SLAB - allocator". - - o CONFIG_FUNCTION_TRACER=n - - This option is located under "Kernel hacking" / "Tracers" / "Kernel - Function Tracer" - - When function tracing is compiled in, gcc emits a call to another - function at the beginning of every function. This means that when the - page fault handler is called, the ftrace framework will be called - before kmemcheck has had a chance to handle the fault. If ftrace then - modifies memory that was tracked by kmemcheck, the result is an - endless recursive page fault. - - o CONFIG_DEBUG_PAGEALLOC=n - - This option is located under "Kernel hacking" / "Memory Debugging" - / "Debug page memory allocations". - -In addition, I highly recommend turning on CONFIG_DEBUG_INFO=y. This is also -located under "Kernel hacking". With this, you will be able to get line number -information from the kmemcheck warnings, which is extremely valuable in -debugging a problem. This option is not mandatory, however, because it slows -down the compilation process and produces a much bigger kernel image. - -Now the kmemcheck menu should be visible (under "Kernel hacking" / "Memory -Debugging" / "kmemcheck: trap use of uninitialized memory"). Here follows -a description of the kmemcheck configuration variables: - - o CONFIG_KMEMCHECK - - This must be enabled in order to use kmemcheck at all... - - o CONFIG_KMEMCHECK_[DISABLED | ENABLED | ONESHOT]_BY_DEFAULT - - This option controls the status of kmemcheck at boot-time. "Enabled" - will enable kmemcheck right from the start, "disabled" will boot the - kernel as normal (but with the kmemcheck code compiled in, so it can - be enabled at run-time after the kernel has booted), and "one-shot" is - a special mode which will turn kmemcheck off automatically after - detecting the first use of uninitialized memory. - - If you are using kmemcheck to actively debug a problem, then you - probably want to choose "enabled" here. - - The one-shot mode is mostly useful in automated test setups because it - can prevent floods of warnings and increase the chances of the machine - surviving in case something is really wrong. In other cases, the one- - shot mode could actually be counter-productive because it would turn - itself off at the very first error -- in the case of a false positive - too -- and this would come in the way of debugging the specific - problem you were interested in. - - If you would like to use your kernel as normal, but with a chance to - enable kmemcheck in case of some problem, it might be a good idea to - choose "disabled" here. When kmemcheck is disabled, most of the run- - time overhead is not incurred, and the kernel will be almost as fast - as normal. - - o CONFIG_KMEMCHECK_QUEUE_SIZE - - Select the maximum number of error reports to store in an internal - (fixed-size) buffer. Since errors can occur virtually anywhere and in - any context, we need a temporary storage area which is guaranteed not - to generate any other page faults when accessed. The queue will be - emptied as soon as a tasklet may be scheduled. If the queue is full, - new error reports will be lost. - - The default value of 64 is probably fine. If some code produces more - than 64 errors within an irqs-off section, then the code is likely to - produce many, many more, too, and these additional reports seldom give - any more information (the first report is usually the most valuable - anyway). - - This number might have to be adjusted if you are not using serial - console or similar to capture the kernel log. If you are using the - "dmesg" command to save the log, then getting a lot of kmemcheck - warnings might overflow the kernel log itself, and the earlier reports - will get lost in that way instead. Try setting this to 10 or so on - such a setup. - - o CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT - - Select the number of shadow bytes to save along with each entry of the - error-report queue. These bytes indicate what parts of an allocation - are initialized, uninitialized, etc. and will be displayed when an - error is detected to help the debugging of a particular problem. - - The number entered here is actually the logarithm of the number of - bytes that will be saved. So if you pick for example 5 here, kmemcheck - will save 2^5 = 32 bytes. - - The default value should be fine for debugging most problems. It also - fits nicely within 80 columns. - - o CONFIG_KMEMCHECK_PARTIAL_OK - - This option (when enabled) works around certain GCC optimizations that - produce 32-bit reads from 16-bit variables where the upper 16 bits are - thrown away afterwards. - - The default value (enabled) is recommended. This may of course hide - some real errors, but disabling it would probably produce a lot of - false positives. - - o CONFIG_KMEMCHECK_BITOPS_OK - - This option silences warnings that would be generated for bit-field - accesses where not all the bits are initialized at the same time. This - may also hide some real bugs. - - This option is probably obsolete, or it should be replaced with - the kmemcheck-/bitfield-annotations for the code in question. The - default value is therefore fine. - -Now compile the kernel as usual. - - -3. How to use -============= - -3.1. Booting -============ - -First some information about the command-line options. There is only one -option specific to kmemcheck, and this is called "kmemcheck". It can be used -to override the default mode as chosen by the CONFIG_KMEMCHECK_*_BY_DEFAULT -option. Its possible settings are: - - o kmemcheck=0 (disabled) - o kmemcheck=1 (enabled) - o kmemcheck=2 (one-shot mode) - -If SLUB debugging has been enabled in the kernel, it may take precedence over -kmemcheck in such a way that the slab caches which are under SLUB debugging -will not be tracked by kmemcheck. In order to ensure that this doesn't happen -(even though it shouldn't by default), use SLUB's boot option "slub_debug", -like this: slub_debug=- - -In fact, this option may also be used for fine-grained control over SLUB vs. -kmemcheck. For example, if the command line includes "kmemcheck=1 -slub_debug=,dentry", then SLUB debugging will be used only for the "dentry" -slab cache, and with kmemcheck tracking all the other caches. This is advanced -usage, however, and is not generally recommended. - - -3.2. Run-time enable/disable -============================ - -When the kernel has booted, it is possible to enable or disable kmemcheck at -run-time. WARNING: This feature is still experimental and may cause false -positive warnings to appear. Therefore, try not to use this. If you find that -it doesn't work properly (e.g. you see an unreasonable amount of warnings), I -will be happy to take bug reports. - -Use the file /proc/sys/kernel/kmemcheck for this purpose, e.g.: - - $ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck - -The numbers are the same as for the kmemcheck= command-line option. - - -3.3. Debugging -============== - -A typical report will look something like this: - -WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) -80000000000000000000000000000000000000000088ffff0000000000000000 - i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u - ^ - -Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A -RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 -RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002 -RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 -RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84 -RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000 -R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e -R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8 -FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000 -CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 -CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0 -DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 -DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 - [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 - [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 - [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 - [<ffffffff8100c7b5>] int_signal+0x12/0x17 - [<ffffffffffffffff>] 0xffffffffffffffff - -The single most valuable information in this report is the RIP (or EIP on 32- -bit) value. This will help us pinpoint exactly which instruction that caused -the warning. - -If your kernel was compiled with CONFIG_DEBUG_INFO=y, then all we have to do -is give this address to the addr2line program, like this: - - $ addr2line -e vmlinux -i ffffffff8104ede8 - arch/x86/include/asm/string_64.h:12 - include/asm-generic/siginfo.h:287 - kernel/signal.c:380 - kernel/signal.c:410 - -The "-e vmlinux" tells addr2line which file to look in. IMPORTANT: This must -be the vmlinux of the kernel that produced the warning in the first place! If -not, the line number information will almost certainly be wrong. - -The "-i" tells addr2line to also print the line numbers of inlined functions. -In this case, the flag was very important, because otherwise, it would only -have printed the first line, which is just a call to memcpy(), which could be -called from a thousand places in the kernel, and is therefore not very useful. -These inlined functions would not show up in the stack trace above, simply -because the kernel doesn't load the extra debugging information. This -technique can of course be used with ordinary kernel oopses as well. - -In this case, it's the caller of memcpy() that is interesting, and it can be -found in include/asm-generic/siginfo.h, line 287: - -281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from) -282 { -283 if (from->si_code < 0) -284 memcpy(to, from, sizeof(*to)); -285 else -286 /* _sigchld is currently the largest know union member */ -287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld)); -288 } - -Since this was a read (kmemcheck usually warns about reads only, though it can -warn about writes to unallocated or freed memory as well), it was probably the -"from" argument which contained some uninitialized bytes. Following the chain -of calls, we move upwards to see where "from" was allocated or initialized, -kernel/signal.c, line 380: - -359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info) -360 { -... -367 list_for_each_entry(q, &list->list, list) { -368 if (q->info.si_signo == sig) { -369 if (first) -370 goto still_pending; -371 first = q; -... -377 if (first) { -378 still_pending: -379 list_del_init(&first->list); -380 copy_siginfo(info, &first->info); -381 __sigqueue_free(first); -... -392 } -393 } - -Here, it is &first->info that is being passed on to copy_siginfo(). The -variable "first" was found on a list -- passed in as the second argument to -collect_signal(). We continue our journey through the stack, to figure out -where the item on "list" was allocated or initialized. We move to line 410: - -395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask, -396 siginfo_t *info) -397 { -... -410 collect_signal(sig, pending, info); -... -414 } - -Now we need to follow the "pending" pointer, since that is being passed on to -collect_signal() as "list". At this point, we've run out of lines from the -"addr2line" output. Not to worry, we just paste the next addresses from the -kmemcheck stack dump, i.e.: - - [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 - [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 - [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 - [<ffffffff8100c7b5>] int_signal+0x12/0x17 - - $ addr2line -e vmlinux -i ffffffff8104f04e ffffffff81050bd8 \ - ffffffff8100b87d ffffffff8100c7b5 - kernel/signal.c:446 - kernel/signal.c:1806 - arch/x86/kernel/signal.c:805 - arch/x86/kernel/signal.c:871 - arch/x86/kernel/entry_64.S:694 - -Remember that since these addresses were found on the stack and not as the -RIP value, they actually point to the _next_ instruction (they are return -addresses). This becomes obvious when we look at the code for line 446: - -422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) -423 { -... -431 signr = __dequeue_signal(&tsk->signal->shared_pending, -432 mask, info); -433 /* -434 * itimer signal ? -435 * -436 * itimers are process shared and we restart periodic -437 * itimers in the signal delivery path to prevent DoS -438 * attacks in the high resolution timer case. This is -439 * compliant with the old way of self restarting -440 * itimers, as the SIGALRM is a legacy signal and only -441 * queued once. Changing the restart behaviour to -442 * restart the timer in the signal dequeue path is -443 * reducing the timer noise on heavy loaded !highres -444 * systems too. -445 */ -446 if (unlikely(signr == SIGALRM)) { -... -489 } - -So instead of looking at 446, we should be looking at 431, which is the line -that executes just before 446. Here we see that what we are looking for is -&tsk->signal->shared_pending. - -Our next task is now to figure out which function that puts items on this -"shared_pending" list. A crude, but efficient tool, is git grep: - - $ git grep -n 'shared_pending' kernel/ - ... - kernel/signal.c:828: pending = group ? &t->signal->shared_pending : &t->pending; - kernel/signal.c:1339: pending = group ? &t->signal->shared_pending : &t->pending; - ... - -There were more results, but none of them were related to list operations, -and these were the only assignments. We inspect the line numbers more closely -and find that this is indeed where items are being added to the list: - -816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, -817 int group) -818 { -... -828 pending = group ? &t->signal->shared_pending : &t->pending; -... -851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && -852 (is_si_special(info) || -853 info->si_code >= 0))); -854 if (q) { -855 list_add_tail(&q->list, &pending->list); -... -890 } - -and: - -1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group) -1310 { -.... -1339 pending = group ? &t->signal->shared_pending : &t->pending; -1340 list_add_tail(&q->list, &pending->list); -.... -1347 } - -In the first case, the list element we are looking for, "q", is being returned -from the function __sigqueue_alloc(), which looks like an allocation function. -Let's take a look at it: - -187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, -188 int override_rlimit) -189 { -190 struct sigqueue *q = NULL; -191 struct user_struct *user; -192 -193 /* -194 * We won't get problems with the target's UID changing under us -195 * because changing it requires RCU be used, and if t != current, the -196 * caller must be holding the RCU readlock (by way of a spinlock) and -197 * we use RCU protection here -198 */ -199 user = get_uid(__task_cred(t)->user); -200 atomic_inc(&user->sigpending); -201 if (override_rlimit || -202 atomic_read(&user->sigpending) <= -203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur) -204 q = kmem_cache_alloc(sigqueue_cachep, flags); -205 if (unlikely(q == NULL)) { -206 atomic_dec(&user->sigpending); -207 free_uid(user); -208 } else { -209 INIT_LIST_HEAD(&q->list); -210 q->flags = 0; -211 q->user = user; -212 } -213 -214 return q; -215 } - -We see that this function initializes q->list, q->flags, and q->user. It seems -that now is the time to look at the definition of "struct sigqueue", e.g.: - -14 struct sigqueue { -15 struct list_head list; -16 int flags; -17 siginfo_t info; -18 struct user_struct *user; -19 }; - -And, you might remember, it was a memcpy() on &first->info that caused the -warning, so this makes perfect sense. It also seems reasonable to assume that -it is the caller of __sigqueue_alloc() that has the responsibility of filling -out (initializing) this member. - -But just which fields of the struct were uninitialized? Let's look at -kmemcheck's report again: - -WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) -80000000000000000000000000000000000000000088ffff0000000000000000 - i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u - ^ - -These first two lines are the memory dump of the memory object itself, and the -shadow bytemap, respectively. The memory object itself is in this case -&first->info. Just beware that the start of this dump is NOT the start of the -object itself! The position of the caret (^) corresponds with the address of -the read (ffff88003e4a2024). - -The shadow bytemap dump legend is as follows: - - i - initialized - u - uninitialized - a - unallocated (memory has been allocated by the slab layer, but has not - yet been handed off to anybody) - f - freed (memory has been allocated by the slab layer, but has been freed - by the previous owner) - -In order to figure out where (relative to the start of the object) the -uninitialized memory was located, we have to look at the disassembly. For -that, we'll need the RIP address again: - -RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 - - $ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8: - ffffffff8104edc8: mov %r8,0x8(%r8) - ffffffff8104edcc: test %r10d,%r10d - ffffffff8104edcf: js ffffffff8104ee88 <__dequeue_signal+0x168> - ffffffff8104edd5: mov %rax,%rdx - ffffffff8104edd8: mov $0xc,%ecx - ffffffff8104eddd: mov %r13,%rdi - ffffffff8104ede0: mov $0x30,%eax - ffffffff8104ede5: mov %rdx,%rsi - ffffffff8104ede8: rep movsl %ds:(%rsi),%es:(%rdi) - ffffffff8104edea: test $0x2,%al - ffffffff8104edec: je ffffffff8104edf0 <__dequeue_signal+0xd0> - ffffffff8104edee: movsw %ds:(%rsi),%es:(%rdi) - ffffffff8104edf0: test $0x1,%al - ffffffff8104edf2: je ffffffff8104edf5 <__dequeue_signal+0xd5> - ffffffff8104edf4: movsb %ds:(%rsi),%es:(%rdi) - ffffffff8104edf5: mov %r8,%rdi - ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free> - -As expected, it's the "rep movsl" instruction from the memcpy() that causes -the warning. We know about REP MOVSL that it uses the register RCX to count -the number of remaining iterations. By taking a look at the register dump -again (from the kmemcheck report), we can figure out how many bytes were left -to copy: - -RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 - -By looking at the disassembly, we also see that %ecx is being loaded with the -value $0xc just before (ffffffff8104edd8), so we are very lucky. Keep in mind -that this is the number of iterations, not bytes. And since this is a "long" -operation, we need to multiply by 4 to get the number of bytes. So this means -that the uninitialized value was encountered at 4 * (0xc - 0x9) = 12 bytes -from the start of the object. - -We can now try to figure out which field of the "struct siginfo" that was not -initialized. This is the beginning of the struct: - -40 typedef struct siginfo { -41 int si_signo; -42 int si_errno; -43 int si_code; -44 -45 union { -.. -92 } _sifields; -93 } siginfo_t; - -On 64-bit, the int is 4 bytes long, so it must the union member that has -not been initialized. We can verify this using gdb: - - $ gdb vmlinux - ... - (gdb) p &((struct siginfo *) 0)->_sifields - $1 = (union {...} *) 0x10 - -Actually, it seems that the union member is located at offset 0x10 -- which -means that gcc has inserted 4 bytes of padding between the members si_code -and _sifields. We can now get a fuller picture of the memory dump: - - _----------------------------=> si_code - / _--------------------=> (padding) - | / _------------=> _sifields(._kill._pid) - | | / _----=> _sifields(._kill._uid) - | | | / --------|-------|-------|-------| -80000000000000000000000000000000000000000088ffff0000000000000000 - i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u - -This allows us to realize another important fact: si_code contains the value -0x80. Remember that x86 is little endian, so the first 4 bytes "80000000" are -really the number 0x00000080. With a bit of research, we find that this is -actually the constant SI_KERNEL defined in include/asm-generic/siginfo.h: - -144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */ - -This macro is used in exactly one place in the x86 kernel: In send_signal() -in kernel/signal.c: - -816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, -817 int group) -818 { -... -828 pending = group ? &t->signal->shared_pending : &t->pending; -... -851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && -852 (is_si_special(info) || -853 info->si_code >= 0))); -854 if (q) { -855 list_add_tail(&q->list, &pending->list); -856 switch ((unsigned long) info) { -... -865 case (unsigned long) SEND_SIG_PRIV: -866 q->info.si_signo = sig; -867 q->info.si_errno = 0; -868 q->info.si_code = SI_KERNEL; -869 q->info.si_pid = 0; -870 q->info.si_uid = 0; -871 break; -... -890 } - -Not only does this match with the .si_code member, it also matches the place -we found earlier when looking for where siginfo_t objects are enqueued on the -"shared_pending" list. - -So to sum up: It seems that it is the padding introduced by the compiler -between two struct fields that is uninitialized, and this gets reported when -we do a memcpy() on the struct. This means that we have identified a false -positive warning. - -Normally, kmemcheck will not report uninitialized accesses in memcpy() calls -when both the source and destination addresses are tracked. (Instead, we copy -the shadow bytemap as well). In this case, the destination address clearly -was not tracked. We can dig a little deeper into the stack trace from above: - - arch/x86/kernel/signal.c:805 - arch/x86/kernel/signal.c:871 - arch/x86/kernel/entry_64.S:694 - -And we clearly see that the destination siginfo object is located on the -stack: - -782 static void do_signal(struct pt_regs *regs) -783 { -784 struct k_sigaction ka; -785 siginfo_t info; -... -804 signr = get_signal_to_deliver(&info, &ka, regs, NULL); -... -854 } - -And this &info is what eventually gets passed to copy_siginfo() as the -destination argument. - -Now, even though we didn't find an actual error here, the example is still a -good one, because it shows how one would go about to find out what the report -was all about. - - -3.4. Annotating false positives -=============================== - -There are a few different ways to make annotations in the source code that -will keep kmemcheck from checking and reporting certain allocations. Here -they are: - - o __GFP_NOTRACK_FALSE_POSITIVE - - This flag can be passed to kmalloc() or kmem_cache_alloc() (therefore - also to other functions that end up calling one of these) to indicate - that the allocation should not be tracked because it would lead to - a false positive report. This is a "big hammer" way of silencing - kmemcheck; after all, even if the false positive pertains to - particular field in a struct, for example, we will now lose the - ability to find (real) errors in other parts of the same struct. - - Example: - - /* No warnings will ever trigger on accessing any part of x */ - x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE); - - o kmemcheck_bitfield_begin(name)/kmemcheck_bitfield_end(name) and - kmemcheck_annotate_bitfield(ptr, name) - - The first two of these three macros can be used inside struct - definitions to signal, respectively, the beginning and end of a - bitfield. Additionally, this will assign the bitfield a name, which - is given as an argument to the macros. - - Having used these markers, one can later use - kmemcheck_annotate_bitfield() at the point of allocation, to indicate - which parts of the allocation is part of a bitfield. - - Example: - - struct foo { - int x; - - kmemcheck_bitfield_begin(flags); - int flag_a:1; - int flag_b:1; - kmemcheck_bitfield_end(flags); - - int y; - }; - - struct foo *x = kmalloc(sizeof *x); - - /* No warnings will trigger on accessing the bitfield of x */ - kmemcheck_annotate_bitfield(x, flags); - - Note that kmemcheck_annotate_bitfield() can be used even before the - return value of kmalloc() is checked -- in other words, passing NULL - as the first argument is legal (and will do nothing). - - -4. Reporting errors -=================== - -As we have seen, kmemcheck will produce false positive reports. Therefore, it -is not very wise to blindly post kmemcheck warnings to mailing lists and -maintainers. Instead, I encourage maintainers and developers to find errors -in their own code. If you get a warning, you can try to work around it, try -to figure out if it's a real error or not, or simply ignore it. Most -developers know their own code and will quickly and efficiently determine the -root cause of a kmemcheck report. This is therefore also the most efficient -way to work with kmemcheck. - -That said, we (the kmemcheck maintainers) will always be on the lookout for -false positives that we can annotate and silence. So whatever you find, -please drop us a note privately! Kernel configs and steps to reproduce (if -available) are of course a great help too. - -Happy hacking! - - -5. Technical description -======================== - -kmemcheck works by marking memory pages non-present. This means that whenever -somebody attempts to access the page, a page fault is generated. The page -fault handler notices that the page was in fact only hidden, and so it calls -on the kmemcheck code to make further investigations. - -When the investigations are completed, kmemcheck "shows" the page by marking -it present (as it would be under normal circumstances). This way, the -interrupted code can continue as usual. - -But after the instruction has been executed, we should hide the page again, so -that we can catch the next access too! Now kmemcheck makes use of a debugging -feature of the processor, namely single-stepping. When the processor has -finished the one instruction that generated the memory access, a debug -exception is raised. From here, we simply hide the page again and continue -execution, this time with the single-stepping feature turned off. - -kmemcheck requires some assistance from the memory allocator in order to work. -The memory allocator needs to - - 1. Tell kmemcheck about newly allocated pages and pages that are about to - be freed. This allows kmemcheck to set up and tear down the shadow memory - for the pages in question. The shadow memory stores the status of each - byte in the allocation proper, e.g. whether it is initialized or - uninitialized. - - 2. Tell kmemcheck which parts of memory should be marked uninitialized. - There are actually a few more states, such as "not yet allocated" and - "recently freed". - -If a slab cache is set up using the SLAB_NOTRACK flag, it will never return -memory that can take page faults because of kmemcheck. - -If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still -request memory with the __GFP_NOTRACK or __GFP_NOTRACK_FALSE_POSITIVE flags. -This does not prevent the page faults from occurring, however, but marks the -object in question as being initialized so that no warnings will ever be -produced for this object. - -Currently, the SLAB and SLUB allocators are supported by kmemcheck. diff --git a/MAINTAINERS b/MAINTAINERS index b235e0d..8107235 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6803,7 +6803,7 @@ KMEMCHECK M: Vegard Nossum <vegardno@xxxxxxxxxx> M: Pekka Enberg <penberg@xxxxxxxxxx> S: Maintained -F: Documentation/kmemcheck.txt +F: Documentation/dev-tools/kmemcheck.rst F: arch/x86/include/asm/kmemcheck.h F: arch/x86/mm/kmemcheck/ F: include/linux/kmemcheck.h -- 2.9.2 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html