Re: [PATCH bpf-next v2 1/7] bpf: implement lookup-free direct value access

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Fri, 1 Mar 2019 10:49:46 +0100

On 03/01/2019 06:46 AM, Andrii Nakryiko wrote:
> On Thu, Feb 28, 2019 at 3:31 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>>
>> This generic extension to BPF maps allows for directly loading an
>> address residing inside a BPF map value as a single BPF ldimm64
>> instruction.
> 
> This is great! I'm going to review code more thoroughly tomorrow, but
> I also have few questions/suggestions I'd like to discuss, if you
> don't mind.

Awesome, thanks!

>> The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
>> is a special src_reg flag for ldimm64 instruction that indicates
>> that inside the first part of the double insns's imm field is a
>> file descriptor which the verifier then replaces as a full 64bit
>> address of the map into both imm parts.
>>
>> For the newly added BPF_PSEUDO_MAP_VALUE src_reg flag, the idea
>> is similar: the first part of the double insns's imm field is
>> again a file descriptor corresponding to the map, and the second
>> part of the imm field is an offset. The verifier will then replace
>> both imm parts with an address that points into the BPF map value
>> for maps that support this operation. BPF_PSEUDO_MAP_VALUE is a
>> distinct flag as otherwise with BPF_PSEUDO_MAP_FD we could not
>> differ offset 0 between load of map pointer versus load of map's
>> value at offset 0.
> 
> Is having both BPF_PSEUDO_MAP_FD and BPF_PSEUDO_MAP_VALUE a desirable
> thing? I'm asking because it's seems like it would be really simple to
> stick to using just BPF_PSEUDO_MAP_FD and then interpret imm
> differently depending on whether it's 0 or not. E.g., we can say that
> imm=0 is old BPF_PSEUDO_MAP_FD behavior (loading map addr), but any
> other imm value X is really just (X-1) offset into map's value? Or,
> given that valid offset is limited to 1<<29, we can set highest-order
> bit to 1 and lower bits would be offset? In other words, if we just
> need too carve out zero as a special case, then it's easy to do and we
> can avoid adding new BPF_PSEUDO_MAP_VALUE.

Was thinking about reusing BPF_PSEUDO_MAP_FD initially as mentioned in
here, but went for BPF_PSEUDO_MAP_VALUE eventually to have a straight
forward mapping. Your suggestion could be done, but it feels more
complex than necessary, imho, meaning it might confuse users trying to
make sense of a insn dump or verifier output wondering whether the off
by one is a bug or not, which won't happen if the offset is exactly the
same value as LLVM emits. There is also one more unfortunate reason
which I noticed while implementing: in replace_map_fd_with_map_ptr()
we never enforced that for BPF_PSEUDO_MAP_FD insns the second imm part
must be 0, meaning it could also have a garbage value which would then
break loaders in the wild; with the code today this is ignored and then
overridden by the map address. We could try to push a patch to stable
to reject anything non-zero in the second imm for BPF_PSEUDO_MAP_FD
and see if anyone actually notices, and then use some higher-order bit
as a selector, but that still would need some extra handling to make
the offset clear for users wrt dumps; I can give it a try though to
check how much more complex it gets. Worst case if something should
really break somewhere, we might need to revert the imm==0 rejection
though. Overall, BPF_PSEUDO_MAP_VALUE felt slightly more suitable to me.

>> This allows for efficiently retrieving an address to a map value
>> memory area without having to issue a helper call which needs to
>> prepare registers according to calling convention, etc, without
>> needing the extra NULL test, and without having to add the offset
>> in an additional instruction to the value base pointer.
> 
> It seems like we allow this only for arrays of size 1 right now. We
> can easily generalize this to support not just offset into map's
> value, but also specifying integer key (i.e., array index) by
> utilizing off fields (16-bit + 16-bit). This would allow to eliminate
> any bpf_map_update_elem calls to array maps altogether by allowing to
> provide both array index and offset into value in one BPF instruction.
> Do you think it's a good addition?

Yeah, I've been thinking about this as well that for array-like maps
it's easy to support and at the same time lifts the restriction, too.
I think it would be useful and straight forward to implement, I can
include it into v3.

>> The verifier then treats the destination register as PTR_TO_MAP_VALUE
>> with constant reg->off from the user passed offset from the second
>> imm field, and guarantees that this is within bounds of the map
>> value. Any subsequent operations are normally treated as typical
>> map value handling without anything else needed for verification.
>>
>> The two map operations for direct value access have been added to
>> array map for now. In future other types could be supported as
>> well depending on the use case. The main use case for this commit
>> is to allow for BPF loader support for global variables that
>> reside in .data/.rodata/.bss sections such that we can directly
>> load the address of them with minimal additional infrastructure
>> required. Loader support has been added in subsequent commits for
>> libbpf library.
> 
> I was considering adding a new kind of map representing contiguous
> block of memory (e.g., how about BPF_MAP_TYPE_HEAP or
> BPF_MAP_TYPE_BLOB?). It's keys would be offset into that memory
> region. Value size is size of the memory region, but it would allow
> reading smaller chunks of memory as values. This would provide
> convenient interface for poking at global variables from userland,
> given offset.
> 
> Libbpf itself would provide higher-level API as well, if there is
> corresponding BTF type information describing layout of
> .data/.bss/.rodata, so that applications can fetch variables by name
> and/or offset, whichever is more convenient. Together with
> bpf_spinlock this would allow easy way to customize subsets of global
> variables in atomic fashion.
> 
> Do you think that would work? Using array is a bit limiting, because
> it doesn't allow to do partial reads/updates, while BPF_MAP_TYPE_HEAP
> would be single big value that allows partial reading/updating.

If I understand it correctly, the main difference this would have is
to be able to use spin_locks in a more fine-grained fashion, right?
Meaning, partial reads/updates of the memory area under spin_lock as
opposed to having to lock over the full area? Yeah, sounds like a
reasonable extension to me that could be done on top of the series,
presumably most of the array map logic could also be reused for this
which is nice.

Thanks a lot,
Daniel