On Fri, Nov 19, 2021 at 9:01 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote: > > 1. One complication to address is the behavior when the target memcg > > hits its memory.max limit because of remote charging. In this case the > > oom-killer will be invoked, but the oom-killer may not find anything > > to kill in the target memcg being charged. Thera are a number of considerations > > in this case: > > > > 1. It's not great to kill the allocating process since the allocating process > > is not running in the memcg under oom, and killing it will not free memory > > in the memcg under oom. > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault > > somehow. If not, the process will forever loop the pagefault in the upstream > > kernel. > > > > In this case, I propose simply failing the remote charge and returning an ENOSPC > > to the caller. This will cause will cause the process executing the remote > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault > > path. This will be documented behavior of remote charging, and this feature is > > opt-in. Users can: > > - Not opt-into the feature if they want. > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and > > abort if they desire. > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their > > operation without executing the remote charge if possible. > > Why is ENOSPC the right error instead of ENOMEM? Returning ENOMEM from mem_cgroup_charge_mapping() will cause the application to get ENOMEM from non-pagefault paths (which is perfectly fine), and get stuck in a loop trying to resolve the pagefault in the pagefault path (less fine). The logic is here: https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1432 ENOMEM gets bubbled up here as VM_FAULT_OOM and on remote charges the behavior I see is that the kernel loops the pagefault forever until memory is freed in the remote memcg, and it may never will. ENOSPC gets bubbled up here as a VM_FAULT_SIGBUS and and sends a SIGBUS to the allocating process. The conjecture here is that it's preferred to send a SIGBUS to the allocating process rather than have it be stuck in a loop trying to resolve a pagefault.