On Tue, Jan 3, 2023 at 11:06 AM Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: > > Hi Andy, > > Thanks for your constructive suggestions. > > On Tue, Jan 03, 2023 at 10:36:01AM -0800, Andy Lutomirski wrote: > > > > c) If there's not enough memory to service a page fault, it's not fatal, > > > > and no signal is sent. Instead, writes are simply lost. > > > > This just seems massively overcomplicated to me. If there isn't > > enough memory to fault in a page of code, we don't have some magic > > instruction emulator in the kernel. We either OOM or we wait for > > memory to show up. > > Before addressing the other parts of your email, I thought I'd touch on > this. Quoting from the email I just wrote Ingo: > > | *However* - if your big objection to this patch is that the instruction > | skipping is problematic, we could actually punt that part. The result > | will be that userspace just retries the memory write and the fault > | happens again, and eventually it succeeds. From a perspective of > | vgetrandom(), that's perhaps worse -- with this v14 patchset, it'll > | immediately fallback to the syscall under memory pressure -- but you > | could argue that nobody really cares about performance at that point > | anyway, and so just retrying the fault until it succeeds is a less > | complex behavior that would be just fine. > | > | Let me know if you think that'd be an acceptable compromise, and I'll > | roll it into v15. As a preview, it pretty much amounts to dropping 3/7 > | and editing the commit message in this 2/7 patch. > > IOW, I think the main ideas of the patch work just fine without "point > c" with the instruction skipping. Instead, waiting/retrying could > potentially work. So, okay, it seems like the two of you both hate the > instruction decoder stuff, so I'll plan on working that part in, in one > way or another, for v15. > > > On Tue, Jan 3, 2023 at 2:50 AM Ingo Molnar <mingo@xxxxxxxxxx> wrote: > > > > The vDSO getrandom() implementation works with a buffer allocated with a > > > > new system call that has certain requirements: > > > > > > > > - It shouldn't be written to core dumps. > > > > * Easy: VM_DONTDUMP. > > > > - It should be zeroed on fork. > > > > * Easy: VM_WIPEONFORK. > > > > I have a rather different suggestion: make a special mapping. Jason, > > you're trying to shoehorn all kinds of bizarre behavior into the core > > mm, and none of that seems to me to belong to the core mm. Instead, > > have an actual special mapping with callbacks that does the right > > thing. No fancy VM flags. > > Oooo! I like this. Avoiding adding VM_* flags would indeed be nice. > I had seen things that I thought looked in this direction with the shmem > API, but when I got into the details, it looked like this was meant for > something else and couldn't address most of what I wanted here. > > If you say this is possible, I'll look again to see if I can figure it > out. Though, if you have some API name at the top of your head, you > might save me some code squinting time. Look for _install_special_mapping(). --Andy > > Want to mlock it? No, don't do that -- that's absurd. Just arrange > > so that, if it gets evicted, it's not written out anywhere. And when > > it gets faulted back in it does the right thing -- see above. > > Presumably mlock calls are redirected to some function pointer so I can > just return EINTR? Or just don't worry about it. If someone mlocks() it, that's their problem. The point is that no one needs to. > > > Zero on fork? I'm sure that's manageable with a special mapping. If > > not, you can add a new vm operation or similar to make it work. (Kind > > of like how we extended special mappings to get mremap right a couple > > years go.) But maybe you don't want to *zero* it on fork and you want > > to do something more intelligent. Fine -- you control ->fault! > > Doing something more intelligent would be an interesting development, I > guess... But, before I think about that, all mapping have flags; > couldn't I *still* set VM_WIPEONFORK on the special mapping? Or does the > API you have in mind not work that way? (Side note: I also want > VM_DONTDUMP to work.) You really want unmap (the pages, not the vma) on fork, not wipe on fork. It'll be VM_SHARED, and I'm not sure what VM_WIPEONFORK | VM_SHARED does.