Re: [PATCH v4 0/4] Deterministic charging of shared memory

Mina Almasry <almasrymina@xxxxxxxxxx> · Fri, 19 Nov 2021 21:27:34 -0800

On Fri, Nov 19, 2021 at 9:01 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> > 1. One complication to address is the behavior when the target memcg
> > hits its memory.max limit because of remote charging. In this case the
> > oom-killer will be invoked, but the oom-killer may not find anything
> > to kill in the target memcg being charged. Thera are a number of considerations
> > in this case:
> >
> > 1. It's not great to kill the allocating process since the allocating process
> >    is not running in the memcg under oom, and killing it will not free memory
> >    in the memcg under oom.
> > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
> >    somehow. If not, the process will forever loop the pagefault in the upstream
> >    kernel.
> >
> > In this case, I propose simply failing the remote charge and returning an ENOSPC
> > to the caller. This will cause will cause the process executing the remote
> > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> > path.  This will be documented behavior of remote charging, and this feature is
> > opt-in. Users can:
> > - Not opt-into the feature if they want.
> > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
> >   abort if they desire.
> > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
> >   operation without executing the remote charge if possible.
>
> Why is ENOSPC the right error instead of ENOMEM?

Returning ENOMEM from mem_cgroup_charge_mapping() will cause the
application to get ENOMEM from non-pagefault paths (which is perfectly
fine), and get stuck in a loop trying to resolve the pagefault in the
pagefault path (less fine). The logic is here:
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1432

ENOMEM gets bubbled up here as VM_FAULT_OOM and on remote charges the
behavior I see is that the kernel loops the pagefault forever until
memory is freed in the remote memcg, and it may never will.

ENOSPC gets bubbled up here as a VM_FAULT_SIGBUS and and sends a
SIGBUS to the allocating process. The conjecture here is that it's
preferred to send a SIGBUS to the allocating process rather than have
it be stuck in a loop trying to resolve a pagefault.