* Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> [230816 15:18]: > * Jann Horn <jannh@xxxxxxxxxx> [230816 13:13]: > > On Wed, Aug 16, 2023 at 6:18 PM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote: > > > * Jann Horn <jannh@xxxxxxxxxx> [230815 15:37]: > > > > commit 18b098af2890 ("vma_merge: set vma iterator to correct > > > > position.") added a vma_prev(vmi) call to vma_merge() at a point where > > > > it's still possible to bail out. My understanding is that this moves > > > > the VMA iterator back by one VMA. > > > > > > > > If you patch some extra logging into the kernel and inject a fake > > > > out-of-memory error at the vma_iter_prealloc() call in vma_split() (a > > > > real out-of-memory error there is very unlikely to happen in practice, > > > > I think - my understanding is that the kernel will basically kill > > > > every process on the system except for init before it starts failing > > > > GFP_KERNEL allocations that fit within a single slab, unless the > > > > allocation uses GFP_ACCOUNT or stuff like that, which the maple tree > > > > doesn't): > > [...] > > > > then you'll get this fun log output, showing that the same VMA > > > > (ffff88810c0b5e00) was visited by two iterations of the VMA iteration > > > > loop, and on the second iteration, prev==vma: > > > > > > > > [ 326.765586] userfaultfd_register: begin vma iteration > > > > [ 326.766985] userfaultfd_register: prev=ffff88810c0b5ef0, > > > > vma=ffff88810c0b5e00 (0000000000101000-0000000000102000) > > > > [ 326.768786] userfaultfd_register: vma_merge returned 0000000000000000 > > > > [ 326.769898] userfaultfd_register: prev=ffff88810c0b5e00, > > > > vma=ffff88810c0b5e00 (0000000000101000-0000000000102000) > > > > > > > > I don't know if this can lead to anything bad but it seems pretty > > > > clearly unintended? > > > > > > Yes, unintended. > > > > > > So we are running out of memory, but since vma_merge() doesn't > > > differentiate between failure and 'nothing to merge', we end up in a > > > situation that we will revisit the same VMA. > > > > > > I've been thinking about a way to work this into the interface and I > > > don't see a clean way because we (could) do different things before the > > > call depending on the situation. > > > > > > I think we need to undo any vma iterator changes in the failure > > > scenarios if there is a chance of the iterator continuing to be used, > > > which is probably not limited to just this case. > > > > I don't fully understand the maple tree interface - in the specific > > case of vma_merge(), could you move the vma_prev() call down below the > > point of no return, after vma_iter_prealloc()? Or does > > vma_iter_prealloc() require that the iterator is already in the insert > > position? > > Yes, but maybe it shouldn't. I detect a write going beyond the end of a > node and take corrective action, but not to the front of a node. > > If I change the internal code to figure out the preallocations without > being pointed at the insert location, I still cannot take corrective > action on failure since I don't know where I should have been within the > tree structure, that is, I have lost the original range. > > I'm still looking at this, but I'm wondering if I should change my > interface for preallocations so I can handle this internally. That > would be a bigger change. > > > > > > I will audit these areas and CC you on the result. Looking at this, I think it's best to make a label and undo the vma_prev() with a vma_next() - at least for now. I'm also reading this for the error path on dup_anon_vma() failure, and it appears to also have an issue which I'd like to point out here before I send the fix for the first issue. ----------- vma_start_write(next); remove = next; /* case 1 */ vma_end = next->vm_end; err = dup_anon_vma(prev, next); if (curr) { /* case 6 */ vma_start_write(curr); remove = curr; remove2 = next; if (!next->anon_vma) err = dup_anon_vma(prev, curr); ----------- Since dup_anon_vma() can fail, I think here in case 6 we could overwrite the failure. That is, we will fail to clone the anon vma and mask the failure if we are running case 6 with an anon in next. Once the first dup_anon_vma() returns error, the next call to clone curr vma may return 0 if there is no anon vma (this, I think _must_ be the case). Then we are in a situation where we will be removing next and expanding prev over curr and next, but have not dup'ed the anon vma from next. Thanks, Liam