Re: [PATCH v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Wed, 23 Mar 2022 12:11:57 +0100

On Wed, Mar 23, 2022 at 10:17:28AM +0100, Michal Hocko wrote:

> > Neither is it "normal" that a VM is scheduled out long enough to miss a
> > 1 second deadline. That might be considered normal by cloud folks, but
> > that's absolute not normal from an OS POV. Again, that's not a OS
> > problem, that's an operator/admin problem.
> 
> Thanks for this clarification. I would tend to agree. Following a
> previous example that oom victims can leave inconsistent state behind
> which can influence other processes. I am wondering what kind of
> expectations about the lock protected state can we make when the holder
> of the lock has been interrupted at any random place in the critical
> section.

Right, this is why the new owner gets the OWNER_DIED bit so it can see
something really dodgy happened.

Getting that means it needs to validate state consistency or just print
a nice error and fully terminate things.

So robust futexes:

 - rely on userspace to maintain a linked list of held locks,

 - rely on lock acquire to check OWNER_DIED and handle state
   inconsistency.

If userspace manages to screw up either one of those, it's game over.
Nothing we can do about it.

Software really has to be built do deal with this, it doesn't magically
work (IOW, in 99% of the case it just doesn't work right).

> [...]
> > > And just to be clear, this is clearly a bug in the oom_reaper per se.
> > > Originally I thought that relaxing the locking (using trylock and
> > > retry/bail out on failure) would help but as I've learned earlier this
> > > day this is not really possible because of #PF at least. The most self
> > > contained solution would be to skip over vmas which are backing the
> > > robust list which would allow the regular exit path to do the proper
> > > cleanup.
> > 
> > That's not sufficient because you have to guarantee that the relevant
> > shared futex is accessible. See the lock chain example above.
> 
> Yeah, my previous understanding was that the whole linked list lives in
> the single mapping and we can just look at their addresses.

Nope; shared futexes live in shared memory and as such the robust_list
entry must live there too. That is, the robust_list entry is embedded in
the lock itself along the lines of:

struct robust_mutex {
	u32 futex;
	struct robust_list list;
};

and then you register the robust_list_head with:

 .futex_offset = offsetof(struct robust_mutex, futex) -
		 offsetof(struct robust_mutex, list);

or somesuch (glibc does all this). And the locks themselves are spread
all over the place.