On Wed, Aug 02, 2017 at 08:41:35AM -0700, Tejun Heo wrote: > > Not entirely sure I follow, we currently only update the current cgroup > > and its immediate parents, no? Or are you looking to only account into > > the current cgroup and propagate into the parents on reading? > > Yeah, shifting the cost to the readers and being smart with > propagation so that reading isn't O(nr_descendants) but > O(nr_descendants_which_have_run_since_last_read). That way, we can > show the basic stats without taxing the hot paths with reasonable > scalability. Right, that would be good. > I have a couple questions about cpuacct tho. > > * The stat file is sampling based and the usage files are calculated > from actual scheduling events. Is this because the latter is more > accurate? So I actually don't know the history of this stuff too well. But I would think so. This all looks rather dodgy. > * Why do we have user/sys breakdown in usage numbers? It tries to > distinguish user or sys by looking at task_pt_regs(). I can't see > how this would work (e.g. interrupt handlers never schedule) and w/o > kernel preemption, the sys part is always zero. What is this number > supposed to mean? For normal scheduler stuff we account the total runtime in ns and use the user/kernel tick samples to divide it into user/kernel time parts. See cputime_adjust(). But looking at the cpuacct I have no clue, that looks wonky at best. Ideally we'd reuse the normal cputime code and do the same thing per-cgroup, but clearly that isn't happening now. I never really looked further than that cpuacct_charge() doing _another_ cgroup iteration, even though we already account that delta to each cgroup (modulo scheduling class crud). -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html