On 03/01/2019 06:46 AM, Andrii Nakryiko wrote: > On Thu, Feb 28, 2019 at 3:31 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: >> >> This generic extension to BPF maps allows for directly loading an >> address residing inside a BPF map value as a single BPF ldimm64 >> instruction. > > This is great! I'm going to review code more thoroughly tomorrow, but > I also have few questions/suggestions I'd like to discuss, if you > don't mind. Awesome, thanks! >> The idea is similar to what BPF_PSEUDO_MAP_FD does today, which >> is a special src_reg flag for ldimm64 instruction that indicates >> that inside the first part of the double insns's imm field is a >> file descriptor which the verifier then replaces as a full 64bit >> address of the map into both imm parts. >> >> For the newly added BPF_PSEUDO_MAP_VALUE src_reg flag, the idea >> is similar: the first part of the double insns's imm field is >> again a file descriptor corresponding to the map, and the second >> part of the imm field is an offset. The verifier will then replace >> both imm parts with an address that points into the BPF map value >> for maps that support this operation. BPF_PSEUDO_MAP_VALUE is a >> distinct flag as otherwise with BPF_PSEUDO_MAP_FD we could not >> differ offset 0 between load of map pointer versus load of map's >> value at offset 0. > > Is having both BPF_PSEUDO_MAP_FD and BPF_PSEUDO_MAP_VALUE a desirable > thing? I'm asking because it's seems like it would be really simple to > stick to using just BPF_PSEUDO_MAP_FD and then interpret imm > differently depending on whether it's 0 or not. E.g., we can say that > imm=0 is old BPF_PSEUDO_MAP_FD behavior (loading map addr), but any > other imm value X is really just (X-1) offset into map's value? Or, > given that valid offset is limited to 1<<29, we can set highest-order > bit to 1 and lower bits would be offset? In other words, if we just > need too carve out zero as a special case, then it's easy to do and we > can avoid adding new BPF_PSEUDO_MAP_VALUE. Was thinking about reusing BPF_PSEUDO_MAP_FD initially as mentioned in here, but went for BPF_PSEUDO_MAP_VALUE eventually to have a straight forward mapping. Your suggestion could be done, but it feels more complex than necessary, imho, meaning it might confuse users trying to make sense of a insn dump or verifier output wondering whether the off by one is a bug or not, which won't happen if the offset is exactly the same value as LLVM emits. There is also one more unfortunate reason which I noticed while implementing: in replace_map_fd_with_map_ptr() we never enforced that for BPF_PSEUDO_MAP_FD insns the second imm part must be 0, meaning it could also have a garbage value which would then break loaders in the wild; with the code today this is ignored and then overridden by the map address. We could try to push a patch to stable to reject anything non-zero in the second imm for BPF_PSEUDO_MAP_FD and see if anyone actually notices, and then use some higher-order bit as a selector, but that still would need some extra handling to make the offset clear for users wrt dumps; I can give it a try though to check how much more complex it gets. Worst case if something should really break somewhere, we might need to revert the imm==0 rejection though. Overall, BPF_PSEUDO_MAP_VALUE felt slightly more suitable to me. >> This allows for efficiently retrieving an address to a map value >> memory area without having to issue a helper call which needs to >> prepare registers according to calling convention, etc, without >> needing the extra NULL test, and without having to add the offset >> in an additional instruction to the value base pointer. > > It seems like we allow this only for arrays of size 1 right now. We > can easily generalize this to support not just offset into map's > value, but also specifying integer key (i.e., array index) by > utilizing off fields (16-bit + 16-bit). This would allow to eliminate > any bpf_map_update_elem calls to array maps altogether by allowing to > provide both array index and offset into value in one BPF instruction. > Do you think it's a good addition? Yeah, I've been thinking about this as well that for array-like maps it's easy to support and at the same time lifts the restriction, too. I think it would be useful and straight forward to implement, I can include it into v3. >> The verifier then treats the destination register as PTR_TO_MAP_VALUE >> with constant reg->off from the user passed offset from the second >> imm field, and guarantees that this is within bounds of the map >> value. Any subsequent operations are normally treated as typical >> map value handling without anything else needed for verification. >> >> The two map operations for direct value access have been added to >> array map for now. In future other types could be supported as >> well depending on the use case. The main use case for this commit >> is to allow for BPF loader support for global variables that >> reside in .data/.rodata/.bss sections such that we can directly >> load the address of them with minimal additional infrastructure >> required. Loader support has been added in subsequent commits for >> libbpf library. > > I was considering adding a new kind of map representing contiguous > block of memory (e.g., how about BPF_MAP_TYPE_HEAP or > BPF_MAP_TYPE_BLOB?). It's keys would be offset into that memory > region. Value size is size of the memory region, but it would allow > reading smaller chunks of memory as values. This would provide > convenient interface for poking at global variables from userland, > given offset. > > Libbpf itself would provide higher-level API as well, if there is > corresponding BTF type information describing layout of > .data/.bss/.rodata, so that applications can fetch variables by name > and/or offset, whichever is more convenient. Together with > bpf_spinlock this would allow easy way to customize subsets of global > variables in atomic fashion. > > Do you think that would work? Using array is a bit limiting, because > it doesn't allow to do partial reads/updates, while BPF_MAP_TYPE_HEAP > would be single big value that allows partial reading/updating. If I understand it correctly, the main difference this would have is to be able to use spin_locks in a more fine-grained fashion, right? Meaning, partial reads/updates of the memory area under spin_lock as opposed to having to lock over the full area? Yeah, sounds like a reasonable extension to me that could be done on top of the series, presumably most of the array map logic could also be reused for this which is nice. Thanks a lot, Daniel