Re: [PATCH bpf-next v2 1/7] bpf: implement lookup-free direct value access

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 3/1/19 11:51 AM, Daniel Borkmann wrote:
> On 03/01/2019 06:18 PM, Yonghong Song wrote:
>> On 2/28/19 3:18 PM, Daniel Borkmann wrote:
>>> This generic extension to BPF maps allows for directly loading an
>>> address residing inside a BPF map value as a single BPF ldimm64
>>> instruction.
>>>
>>> The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
>>> is a special src_reg flag for ldimm64 instruction that indicates
>>> that inside the first part of the double insns's imm field is a
>>> file descriptor which the verifier then replaces as a full 64bit
>>> address of the map into both imm parts.
>>>
>>> For the newly added BPF_PSEUDO_MAP_VALUE src_reg flag, the idea
>>> is similar: the first part of the double insns's imm field is
>>> again a file descriptor corresponding to the map, and the second
>>> part of the imm field is an offset. The verifier will then replace
>>> both imm parts with an address that points into the BPF map value
>>> for maps that support this operation. BPF_PSEUDO_MAP_VALUE is a
>>> distinct flag as otherwise with BPF_PSEUDO_MAP_FD we could not
>>> differ offset 0 between load of map pointer versus load of map's
>>> value at offset 0.
>>>
>>> This allows for efficiently retrieving an address to a map value
>>> memory area without having to issue a helper call which needs to
>>> prepare registers according to calling convention, etc, without
>>> needing the extra NULL test, and without having to add the offset
>>> in an additional instruction to the value base pointer.
>>>
>>> The verifier then treats the destination register as PTR_TO_MAP_VALUE
>>> with constant reg->off from the user passed offset from the second
>>> imm field, and guarantees that this is within bounds of the map
>>> value. Any subsequent operations are normally treated as typical
>>> map value handling without anything else needed for verification.
>>>
>>> The two map operations for direct value access have been added to
>>> array map for now. In future other types could be supported as
>>> well depending on the use case. The main use case for this commit
>>> is to allow for BPF loader support for global variables that
>>> reside in .data/.rodata/.bss sections such that we can directly
>>> load the address of them with minimal additional infrastructure
>>> required. Loader support has been added in subsequent commits for
>>> libbpf library.
>>
>> The patch version #1 provides a way to replace the load with
>> immediate (presumably read-only data). This will be good for
>> the use case like below:
>>
>>      if (static_variable_kernel_version == V1) {
>>          /* code here will work for kernel V1 */
>>          ... access helpers available for V1 ...
>>      } else if (static_variable_kernel_version == V2) {
>>          /* code here will work for kernel V2 */
>>          ... access helpers available for V2 ...
>>      }
>>
>> The approach here did not replace the map value access with values from
>> e.g., readonly section for which libbpf could provide an interface to
>> fill in data from user.
>>
>> This may require a little more analysis, e.g.,
>>      ptr = ld_imm64 from a readonly section
>>      ...
>>      *(u32 *)ptr;
>>      *(u64 *)(ptr + 8);
>>      ...
>>
>> Do you think we could do this in kernel verifier or we should
>> push the whole readonly stuff into user space?
> 
> And in your case the static_variable_kernel_version would be determined
> at runtime, for example, where you then would want to eliminate all the
> other branches, right? Meaning, you'd need a way to turn this into a imm
> load such that verifier will detect these dead branches and patch them

Yes, the program will be compiled once and deployed to many hosts with 
different kernel versions. Different hosts may have different kernel
versions. The static_variable_kernel_version is determined

> out, which it should already be able to do. How would you mark these
> special vars like static_variable_kernel_version such that they have
> special treatment from the rest, some sort of builtin? Potentially one

A libbpf API is needed to assign a particular value to a readonly 
section value. For example, a bpf program may look like:

-bash-4.4$ cat g1.c
static volatile const unsigned __kernel_version;
int prog() {
   unsigned kernel_ver = __kernel_version;

   if (kernel_ver == 411)
     return 0;
   else if (kernel_ver == 416)
     return 1;
   return 2;
}
-bash-4.4$ clang -target bpf -O2 -c g1.c 

-bash-4.4$ llvm-readelf -r g1.o

Relocation section '.rel.text' at offset 0x178 contains 1 entries:
     Offset             Info             Type               Symbol's 
Value  Symbol's Name
0000000000000000  0000000500000001 R_BPF_64_64 
0000000000000000 .rodata
-bash-4.4$ llvm-objdump -d g1.o 


g1.o:   file format ELF64-BPF

Disassembly of section .text:
0000000000000000 prog:
        0:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
r1 = 0 ll
        2:       61 11 00 00 00 00 00 00         r1 = *(u32 *)(r1 + 0)
        3:       b7 02 00 00 01 00 00 00         r2 = 1
        4:       15 01 01 00 a0 01 00 00         if r1 == 416 goto +1 
<LBB0_2>
        5:       b7 02 00 00 02 00 00 00         r2 = 2

0000000000000030 LBB0_2:
        6:       b7 00 00 00 00 00 00 00         r0 = 0
        7:       15 01 01 00 9b 01 00 00         if r1 == 411 goto +1 
<LBB0_4>
        8:       bf 20 00 00 00 00 00 00         r0 = r2

0000000000000048 LBB0_4:
        9:       95 00 00 00 00 00 00 00         exit
-bash-4.4$ llvm-readelf -S g1.o
There are 9 section headers, starting at offset 0x1f8:

Section Headers:
   [Nr] Name              Type            Address          Off    Size 
ES Flg Lk Inf Al
   [ 0]                   NULL            0000000000000000 000000 000000 
00      0   0  0
   [ 1] .strtab           STRTAB          0000000000000000 000189 000068 
00      0   0  1
   [ 2] .text             PROGBITS        0000000000000000 000040 000050 
00  AX  0   0  8
   [ 3] .rel.text         REL             0000000000000000 000178 000010 
10      8   2  8
   [ 4] .rodata           PROGBITS        0000000000000000 000090 000004 
00   A  0   0  4
   [ 5] .BTF              PROGBITS        0000000000000000 000094 000019 
00      0   0  1
   [ 6] .BTF.ext          PROGBITS        0000000000000000 0000ad 000020 
00      0   0  1
   [ 7] .llvm_addrsig     LLVM_ADDRSIG    0000000000000000 000188 000001 
00   E  8   0  1
   [ 8] .symtab           SYMTAB          0000000000000000 0000d0 0000a8 
18      1   6  8
Key to Flags:
   W (write), A (alloc), X (execute), M (merge), S (strings), l (large)
   I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
   O (extra OS processing required) o (OS specific), p (processor specific)
-bash-4.4$ llvm-readelf -s g1.o 


Symbol table '.symtab' contains 7 entries:
    Num:    Value          Size Type    Bind   Vis      Ndx Name
      0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
      1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS g1.c
      2: 0000000000000030     0 NOTYPE  LOCAL  DEFAULT    2 LBB0_2
      3: 0000000000000048     0 NOTYPE  LOCAL  DEFAULT    2 LBB0_4
      4: 0000000000000000     4 OBJECT  LOCAL  DEFAULT    4 __kernel_version
      5: 0000000000000000     0 SECTION LOCAL  DEFAULT    4 .rodata
      6: 0000000000000000    80 FUNC    GLOBAL DEFAULT    2 prog
-bash-4.4$

The relocation is for the first insn.
The address is the start of .rodata section, which happens to
match variable __kernel_version (size 4).

The libbpf API can provide a way for user to assign a value to readonly 
section. In this particular case, e.g., on HostA, __kernel_version is
assigned to 416, which means the first 4 bytes of .rodata is modified
to have value 416. Considering this is a generic interface, the API
may look like
   bpf_object__change_readonly_value(const char *var_name, void *val_buf,
     unsigned var_buf_size);
The libbpf will change the value if there is a "var_name" in rodata
section and val_buf_size matches the size in the symbol table.

> could get away with doing this from loader side if it's simple enough,
> though one thing that would be good to avoid is to duplicate all the
> complex branch fixup logic etc that we have in kernel already. Are you

I totally agree that kernel is already able to prune dead codes while 
maintaining correct func/line info. We should do that part in kernel.

Let us look at the byte codes,

0000000000000000 prog:
        0:       r1 = 0 ll
        2:       r1 = *(u32 *)(r1 + 0)
        3:       r2 = 1
        4:       if r1 == 416 goto +1 <LBB0_2>
        5:       r2 = 2

0000000000000030 LBB0_2:
        6:       r0 = 0
        7:       if r1 == 411 goto +1 <LBB0_4>
        8:       r0 = r2

0000000000000048 LBB0_4:
        9:       exit

Here, the goal is to let r1 at insn #2 get the constant.
Do you think we can get it from the kernel? In this particular case,
insn #0, get a romap_ptr with addr of rodata section offset 0,
insn #2, load u32 from romap offset 0, the value is already populated, 
e.g., 416.

The verifier is path sensitive, will need extra care to
perform such transformation in case it is invalid in different paths.
Maybe slightly extension of verifier is able to do this?
Initially we do not need to handle complicated cases. Most global/static
variable accesses are all like
    r1 = #num ll
    r1 = *(type *)(r1 + offset)
If there is branch into the middle of the above pair of insns
and r1 is romap_ptr, it is totally safe to replace the second insn
as r1 = constant which can enable later dead code elimination.
If all read only region access are converted to constants,
"r1 = #num ll" ld_imm64 insns can be removed as well.

> thinking to mark these via BTF in some way such that loader does inline
> replacement?

I have not thought about BTF. BTF could provide information about insn 
#2 referring to a particular readonly section location. But looks like 
verifier is able to track it as well in the above?

Let us first study whether without BTF is okay. If needed, we can go
through BTF path with compiler assistance.

> 
> Thanks,
> Daniel
> 




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux