On Thu, Jul 19, 2018 at 08:08:20AM -0700, Linus Torvalds wrote: > On Wed, Jul 18, 2018 at 5:03 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > And as said before, we can compress the state from 12 bytes, to 6 bits > > (or 1 byte), giving another 11 bytes for 59 bytes free. > > > > Leaving us just 5 bytes short of needing a single cacheline :/ > > Do you actually need 64 bits for the times? > > That's the big cost. And it seems ridiculous, if you actually care about size. > > You already have a 64-bit start time. Everything else is some > cumulative relative time. Do those really need 64-bit and nanosecond > resolution? > > Maybe a 32-bit microsecond would be ok - would you ever account more > than 35 minutes of anything without starting anew? D'oh, you're right, the per-cpu buckets don't need to be this big at all. In fact, we flush those deltas out every 2 seconds when there is activity to maintain the running averages. Since we get 4.2s worth of nanoseconds into a u32, we don't even need to divide in the hotpath. Something along the lines of this here should work: static void psi_group_change(struct psi_group *group, int cpu, u64 now, unsigned int clear, unsigned int set) { struct psi_group_cpu *groupc; unsigned int *tasks; unsigned int t; u32 delta; groupc = per_cpu_ptr(group->cpus, cpu); tasks = groupc->tasks; /* Time since last task change on this runqueue */ delta = now - groupc->last_time; groupc->last_time = now; /* Tasks waited for IO? */ if (tasks[NR_IOWAIT]) { if (!tasks[NR_RUNNING]) groupc->full_time[PSI_IO] += delta; else groupc->some_time[PSI_IO] += delta; } /* Tasks waited for memory? */ if (tasks[NR_MEMSTALL]) { if (!tasks[NR_RUNNING] || (cpu_curr(cpu)->flags & PF_MEMSTALL)) groupc->full_time[PSI_MEM] += delta; else groupc->some_time[PSI_MEM] += delta; } /* Tasks waited for the CPU? */ if (tasks[NR_RUNNING] > 1) groupc->some_time[PSI_CPU] += delta; /* Tasks were generally non-idle? To weigh the CPU in summaries */ if (tasks[NR_RUNNING] || tasks[NR_IOWAIT] || tasks[NR_MEMSTALL]) groupc->nonidle_time += delta; /* Update task counts according to the set/clear bitmasks */ for (t = 0; clear; clear &= ~(1 << t), t++) if (clear & (1 << t)) groupc->tasks[t]--; for (t = 0; set; set &= ~(1 << t), t++) if (set & (1 << t)) groupc->tasks[t]++; /* Kick the stats aggregation worker if it's gone to sleep */ if (!delayed_work_pending(&group->clock_work)) schedule_delayed_work(&group->clock_work, PSI_FREQ); } And then we can pack it down to one cacheline: struct psi_group_cpu { /* States of the tasks belonging to this group */ unsigned int tasks[NR_PSI_TASK_COUNTS]; // 3 /* Time sampling bucket for pressure states - no FULL for CPU */ u32 some_time[NR_PSI_RESOURCES]; u32 full_time[NR_PSI_RESOURCES - 1]; /* Time sampling bucket for non-idle state (ns) */ u32 nonidle_time; /* Time of last task change in this group (rq_clock) */ u64 last_time; }; I'm going to go test with this. Thanks -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html