Hello Oleg, On Wed, 2010-10-06 at 17:26 +0200, Oleg Nesterov wrote: > > The patch also approaches another ugly Unix behavior regarding process > > accounting. If a parent process dies before his children, the children > > get the reaper process (init) as new parent. If we want to determine the > > CPU usage of a process tree with cumulative time, this is very > > suboptimal. To fix this I added a new process relationship tree for > > accounting. > > Well, I must admit, I can't say I like the complications this change adds ;) > In any case, imho this change needs a separate patch/discussion. Well, to be honest, I have not expected that people will love this patch. At least not immediately :-) I just wanted to show one idea how we could solve my problem with the lost time. > > Besides of that the patch adds an "acct_parent" pointer next to the parent > > pointer and a "children_acct" list next to the children list to the > > task_struct in order to remember the correct accounting task relationship. > > I am not sure I understand the "correct accounting" above. ->acct_parent > adds the "parallel" hierarchy. Correct. The patch adds a "parallel" hierarchy that IMHO is better suited for accounting purposes. For me the correct accounting hierarchy means that cumulative time is passed along the real relationship tree. That means if you have two task snapshots it is clear which tasks inherited the time of the dead children. Example: P1 is parent of P2 is parent of P3 Snapshot 1: P1 -> P2 -> P3 Snapshot 2: P1 We know that P2 and P3 died in the last interval. With the current cumulative time accounting we can't say, if P1 got the CPU time of P3 (if P3 died before P2) or if init got the time (if P2 died before P3). With my patch we know that P1 got all the CPU time of P2 and P3. Without the patch a top command had somehow to receive the "reparent to init" events or it had to receive all task exit events. The latter is probably very CPU intensive for workloads that create many short-running processes. > In the simplest case, suppose that some > process P forks the child C and exits. Then C->acct_parent == P->real_parent > (P->acct_parent in general). I am not sure this is always good. For most cases both trees are identical. Only, if reparent to init happens, the trees become different. Maybe it is possible to optimize the code that only if the trees differ the information is stored. > Anyway, > > > @@ -90,6 +156,24 @@ static void __exit_signal(struct task_st > > > > posix_cpu_timers_exit(tsk); > > if (group_dead) { > > + if (!tsk->exit_accounting_done) { > > +#ifdef __s390x__ > > + /* > > + * FIXME: On s390 we can call account_process_tick to update > > + * CPU time information. This is probably not valid on other > > + * architectures. > > + */ > > + if (current == tsk) > > + account_process_tick(current, 1); > > +#endif > > + /* > > + * FIXME: This somehow has to be moved to > > + * finish_task_switch(), because otherwise > > + * if the process accounts itself, the CPU time > > + * that is used for this code will be lost. > > + */ > > + __account_to_parent(tsk, 0); > > We hold the wrong ->siglock here. Right, we hold the siglock of the exiting task, but we should hold the lock of the parent, correct? > Also, the logic behind ->exit_accounting_done looks wrong (and unneeded) > but I am not sure... I think the logic is correct, but probably a better implementation is possible. The member "exit_accounting_done" is used for the following two cases: 1. Process becomes a zombie and parent waits for it: * wait_task_zombie() calls __account_to_parent(p, 1) that sets exit_accounting_done = 1 * Then release_task()/__exit_signal() is called that does no accounting, because exit_accounting_done is already set to 1 2. Process reaps itself: * release_task()/__exit_signal() is called, exit_accounting_done is still 0, therefore __account_to_parent(tsk, 0) is called > > @@ -772,6 +869,15 @@ static void forget_original_parent(struc > > LIST_HEAD(dead_children); > > > > write_lock_irq(&tasklist_lock); > > + list_for_each_entry_safe(p, n, &father->children_acct, sibling_acct) { > > + struct task_struct *t = p; > > + do { > > + t->acct_parent = t->acct_parent->acct_parent; > > + } while_each_thread(p, t); > > + list_move_tail(&p->sibling_acct, > > + &p->acct_parent->children_acct); > > This is certainly wrong if there are other live threads in father's > thread-group. Sorry for my ignorance. Probably I have not understood what happens, if a thread group leader dies. My assumption was that then the whole thread group dies. Also I assumed that a parent can only be a thread group leader. So what you are saying is that when a thread group leader dies and there are other live threads in his thread group, his children will NOT get init as new parent? Instead they get the new thread group leader as parent? > Also, you need to change de_thread() if it changes the leader. Something like the following? --- a/fs/exec.c +++ b/fs/exec.c @@ -850,6 +850,7 @@ static int de_thread(struct task_struct list_replace_rcu(&leader->tasks, &tsk->tasks); list_replace_init(&leader->sibling, &tsk->sibling); + list_replace_init(&leader->sibling_acct, &tsk->sibling_acct); tsk->group_leader = tsk; leader->group_leader = tsk; > > > list_for_each_entry_safe(p, n, &dead_children, sibling) { > > list_del_init(&p->sibling); > > + list_del_init(&p->sibling_acct); > > This list_del() can race with ->acct_parent if it in turn exits and > does forget_original_parent() -> list_move_tail(sibling_acct). Ok, to fix this I could move the list_del_init(&p->sibling_acct) to reparent_leader(): @@ -761,6 +868,7 @@ static void reparent_leader(struct task_ if (task_detached(p)) { p->exit_state = EXIT_DEAD; list_move_tail(&p->sibling, dead); + list_del_init(&p->sibling_acct); } } Michael -- To unsubscribe from this list: send the line "unsubscribe linux-s390" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html