On Tue, Jun 16, 2020 at 4:04 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Mon, Jun 15, 2020 at 09:41:38AM -0700, Stanislav Fomichev wrote: > > On Fri, Jun 12, 2020 at 8:50 PM Alexei Starovoitov > > <alexei.starovoitov@xxxxxxxxx> wrote: > > [ .. ] > > > > > It's probably ok, but makes me uneasy about verifier consequences. > > > > > ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov. > > > > > Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ? > > > > > I don't think we have such tests. I guess bpf prog won't be able to read > > > > > anything and nothing will crash, but having PTR_TO_PACKET that is > > > > > actually NULL would be an odd special case to keep in mind for everyone > > > > > who will work on the verifier from now on. > > > > > > > > > > Also consider bpf prog that simply reads something small like 4 bytes. > > > > > IP_FREEBIND sockopt (like your selftest in the patch 2) will have > > > > > those 4 bytes, so it's natural for the prog to assume that it can read it. > > > > > It will have > > > > > p = ctx->optval; > > > > > if (p + 4 > ctx->optval_end) > > > > > /* goto out and don't bother logging, since that never happens */ > > > > > *(u32*)p; > > > > > > > > > > but 'clever' user space would pass long optlen and prog suddenly > > > > > 'not seeing' the sockopt. It didn't crash, but debugging would be > > > > > surprising. > > > > > > > > > > I feel it's better to copy the first 4k and let the program see it. > > > > Agreed with the IP_FREEBIND example wrt observability, however it's > > > > not clear what to do with the cropped buffer if the bpf program > > > > modifies it. > > > > > > > > Consider that huge iptables setsockopts where the usespace passes > > > > PAGE_SIZE*10 optlen with real data and bpf prog sees only part of it. > > > > Now, if the bpf program modifies the buffer (say, flips some byte), we > > > > are back to square one. We either have to silently discard that buffer > > > > or reallocate/merge. My reasoning with data == NULL, is that at least > > > > there is a clear signal that the program can't access the data (and > > > > can look at optlen to see if the original buffer is indeed non-zero > > > > and maybe deny such requests?). > > > > At this point I'm really starting to think that maybe we should just > > > > vmalloc everything that is >PAGE_SIZE and add a sysclt to limit an > > > > upper bound :-/ > > > > I'll try to think about this a bit more over the weekend. > > > > > > Yeah. Tough choices. > > > We can also detect in the verifier whether program accessed ctx->optval > > > and skip alloc/copy if program didn't touch it, but I suspect in most > > > case the program would want to read it. > > > I think vmallocing what optlen said is DoS-able. It's better to > > > stick with single page. > > > Let's keep brainstorming. > > Btw, can we use sleepable bpf for that? As in, do whatever I suggested > > in these patches (don't expose optval>PAGE_SIZE via context), but add > > a new helper where you can say 'copy x bytes from y offset of the > > original optval' (the helper will do sleepable copy_form_user). > > That way we have a clean signal to the BPF that the value is too big > > (optval==optval_end==NULL) and the user can fallback to the helper to > > inspect the value. We can also provide another helper to export new > > value for this case. > > sleepable will be read-only and with toctou. > I guess this patch is the least evil then ? > But I'm confused with the test in patch 2. > Why does it do 'if (optval > optval_end)' ? > How is that possible when patch 1 makes them equal. Right, it should really be 'optval < optval_end', good point! I want to make sure in the BPF program that we can't access the buffer (essentially return EPERM to the userpace so the test fails). Will fix in a follow up. > may be another idea: > allocate 4k, copy first 4k into it, but keep ctx.optlen as original. > if bpf prog reads optval and finds it ok in setsockopt, > it can set ctx.optlen = 0 > which would mean run the rest of setsockopt handling with original > '__user *optval' and ignore the buffer that was passed to bpf prog. > In case of ctx.optlen < 4k the behavior won't change from bpf prog > and from kernel pov. > When ctx.optlen > 4k and prog didn't adjust it > __cgroup_bpf_run_filter_setsockopt will do ret = -EFAULT; > and reject sockopt. > So bpf prog would be force to do something with large optvals. > yet it will be able to examine first 4k bytes of it. > It's a bit of change in behavior, but I don't think bpf prog is > doing ctx.optlen = 0, since that's more or less the same as > doing ctx.optlen = -2. > or may be use some other indicator ? > And do something similar with getsockopt. Good suggestion! That sounds doable in theory, let me try and see if I hit something unexpected. I suppose trimming provided optval to zero (optlen=0) is not something that sensible programs would do? In this case please ignore the v4 I posted recently!