Hello T.J. A curious case. I was staring at the code and any ways occurring to me would imply css_set_lock doesn't work. OTOH, I can bring the reproducer to rmdir()=-EBUSY on my machine (6.4.12-1-default) [1]. I notice that there are 2*nr_cpus parallel readers of cgroup.procs. And a single thread's testimony is enough to consider cgroup empty. Could it be that despite the 200ms delay, some of the threads see the cgroup empty _yet_? (I didn't do own tracing but by reducing the delay, I could reduce the time before EBUSY was hit, otherwise it took several minutes (on top of desktop background).) On Tue, Oct 03, 2023 at 11:01:46AM -0700, "T.J. Mercier" <tjmercier@xxxxxxxxxx> wrote: ... > > The trace events look like this when the problem occurs. I'm guessing > > the rmdir is attempted in that window between signal_deliver and > > cgroup_notify_populated = 0. But rmdir() happens after empty cgroup.procs was spotted, right? (That's why it is curious.) > > However on Android we retry the rmdir for 2 seconds after cgroup.procs > > is empty and we're still occasionally hitting the failure. On my > > primary phone with 3 days of uptime I see a handful of cases, but the > > problem is orders of magnitude worse on Samsung's device. Would there also be short-lived members of cgroups and reading cgroup.procs under load? Thanks, Michal [1] FTR, a hunk to run it without sudo on a modern desktop: -static const std::filesystem::path CG_A_PATH = "/sys/fs/cgroup/A"; -static const std::filesystem::path CG_B_PATH = "/sys/fs/cgroup/B"; +static const std::filesystem::path CG_A_PATH = "/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/a"; +static const std::filesystem::path CG_B_PATH = "/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/b";
Attachment:
signature.asc
Description: PGP signature