On Fri, Aug 26, 2011 at 4:19 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
(I'm back!)
Thank you Andrew for the comments.
Hmm, Looks like we still need some changes for this patch to get it merged into -mm and I might be able to jump into it sometime next week. :)
--Ying
The code looks OK(ish) to me. I'm still concerned that the implementation
On Thu, 28 Jul 2011 18:25:59 -0700
Andrew Bresticker <abrestic@xxxxxxxxxx> wrote:
> The problem with small dmesg ring buffer like 512k is that only limited number
> of task traces will be logged. Sometimes we lose important information only
> because of too many duplicated stack traces. This problem occurs when dumping
> lots of stacks in a single operation, such as sysrq-T.
>
> This patch tries to reduce the duplication of task stack trace in the dump
> message by hashing the task stack. The hashtable is a 32k pre-allocated buffer
> during bootup. Then we hash the task stack with stack_depth 32 for each stack
> entry. Each time if we find the identical task trace in the task stack, we dump
> only the pid of the task which has the task trace dumped. So it is easy to back
> track to the full stack with the pid.
>
> [ 58.469730] kworker/0:0 S 0000000000000000 0 4 2 0x00000000
> [ 58.469735] ffff88082fcfde80 0000000000000046 ffff88082e9d8000 ffff88082fcfc010
> [ 58.469739] ffff88082fce9860 0000000000011440 ffff88082fcfdfd8 ffff88082fcfdfd8
> [ 58.469743] 0000000000011440 0000000000000000 ffff88082fcee180 ffff88082fce9860
> [ 58.469747] Call Trace:
> [ 58.469751] [<ffffffff8108525a>] worker_thread+0x24b/0x250
> [ 58.469754] [<ffffffff8108500f>] ? manage_workers+0x192/0x192
> [ 58.469757] [<ffffffff810885bd>] kthread+0x82/0x8a
> [ 58.469760] [<ffffffff8141aed4>] kernel_thread_helper+0x4/0x10
> [ 58.469763] [<ffffffff8108853b>] ? kthread_worker_fn+0x112/0x112
> [ 58.469765] [<ffffffff8141aed0>] ? gs_change+0xb/0xb
> [ 58.469768] kworker/u:0 S 0000000000000004 0 5 2 0x00000000
> [ 58.469773] ffff88082fcffe80 0000000000000046 ffff880800000000 ffff88082fcfe010
> [ 58.469777] ffff88082fcea080 0000000000011440 ffff88082fcfffd8 ffff88082fcfffd8
> [ 58.469781] 0000000000011440 0000000000000000 ffff88082fd4e9a0 ffff88082fcea080
> [ 58.469785] Call Trace:
> [ 58.469786] <Same stack as pid 4>
> [ 58.470235] kworker/0:1 S 0000000000000000 0 13 2 0x00000000
> [ 58.470255] ffff88082fd3fe80 0000000000000046 ffff880800000000 ffff88082fd3e010
> [ 58.470279] ffff88082fcee180 0000000000011440 ffff88082fd3ffd8 ffff88082fd3ffd8
> [ 58.470301] 0000000000011440 0000000000000000 ffffffff8180b020 ffff88082fcee180
> [ 58.470325] Call Trace:
> [ 58.470332] <Same stack as pid 4>
will miss lots of de-duplications because it is hashing random crud in
the stack frame.
Well, we can't break all architectures.
> Note: Non-x86 architectures will need to be updated since show_stack()
> now takes an additional argument.
I can't think of a way to make the preprocessor convert show_stack(a,
b) into show_stack(a, b, N) (this can be done in the other direction).
So all I can think of is to rename x86 show_stack() to something else and do
#define show_stack_something_else(a, b, c) show_stack(a, b)
for other architectures.
But on the other hand, why did the show_stack() interface get changed?
show_stack() dumps a single tasks's stack, so top-level callers have no
earthly reason to be passing the dup_stack_pid into show_stack().
dup_stack_pid is purely for many-task stackdumps.
Also, the code as-is is pretty much useless for other architectures.
The core changes in arch/x86/kernel/stacktrace.c look pretty generic -
can we design and place this code so that all architectures can use it?
> ...
> The problem with small dmesg ring buffer like 512k is that only limited number
> of task traces will be logged. Sometimes we lose important information only
> because of too many duplicated stack traces. This problem occurs when dumping
> lots of stacks in a single operation, such as sysrq-T.
>
> This patch tries to reduce the duplication of task stack trace in the dump
> message by hashing the task stack. The hashtable is a 32k pre-allocated buffer
> during bootup. Then we hash the task stack with stack_depth 32 for each stack
> entry. Each time if we find the identical task trace in the task stack, we dump
> only the pid of the task which has the task trace dumped. So it is easy to back
> track to the full stack with the pid.
>
>
>It wouldn't hurt to document DEDUP_HASH_MAX_ITERATIONS (at least).
> +/*
> + * The implementation of stack trace dedup. It tries to reduce the duplication
> + * of task stack trace in the dump by hashing the stack trace. The hashtable is
> + * 32k pre-allocated buffer. Then we hash the task stack with stack_depth
> + * DEDUP_MAX_STACK_DEPTH for each stack entry. Each time if an identical trace
> + * is found in the stack, we dump only the pid of previous task. So it is easy
> + * to back track to the full stack with the pid.
> + */
> +#define DEDUP_MAX_STACK_DEPTH 32
> +#define DEDUP_STACK_HASH 32768
> +#define DEDUP_STACK_ENTRIES (DEDUP_STACK_HASH/sizeof(struct task_stack))
> +#define DEDUP_HASH_MAX_ITERATIONS 10
But then, why does DEDUP_HASH_MAX_ITERATIONS exist? (below)
(this would be more efficient if DEDUP_STACK_ENTRIES was a power of 2)
> +struct task_stack {
> + pid_t pid;
> + int len;
> + unsigned long hash;
> +};
> +
> +static struct task_stack stack_hash_table[DEDUP_STACK_ENTRIES];
> +static struct task_stack cur_stack;
> +static __cacheline_aligned_in_smp DEFINE_SPINLOCK(stack_hash_lock);
> +
> +/*
> + * The stack hashtable uses linear probing to resolve collisions.
> + * We consider two stacks to be the same if their hash values and lengths
> + * are equal.
> + */
> +static unsigned int stack_trace_lookup(void)
> +{
> + int j;
> + int index;
> + unsigned int ret = 0;
> + struct task_stack *stack;
> +
> + index = cur_stack.hash % DEDUP_STACK_ENTRIES;
> +
> + for (j = 0; j < DEDUP_HASH_MAX_ITERATIONS; j++) {
> + stack = stack_hash_table + (index + j) % DEDUP_STACK_ENTRIES;
Why stop there? Why not just append to stack_hash_table[]? When we
> + if (stack->hash == 0) {
> + *stack = cur_stack;
> + ret = 0;
> + break;
> + } else {
> + if (stack->hash == cur_stack.hash &&
> + stack->len == cur_stack.len) {
> + ret = stack->pid;
> + break;
> + }
> + }
> + }
> + if (j == DEDUP_HASH_MAX_ITERATIONS)
> + stack_hash_table[index] = cur_stack;
first decide to do a multi-task stackdump, zero the index into the
array. Each time a task is processed, look to see if it is unique and
if so, add its task_stack to the end of the array.
This may require adding a stacktrace_ops.start(). This could be done
while moving stacktrace_ops (which advertises itself as a "Generic
stack tracer"!) out of x86-specific code.
Sane, but I'm not sure it's necessary.
> + memset(&cur_stack, 0, sizeof(cur_stack));
> + return ret;
> +}
> +
>
> ...
>
Making this all arch-neutral is quite a bit of work, which you may not
feel like undertaking, ho hum. Also, the lack of any documentation in
that x86 code makes it unready for prime time.