> On Jan 24, 2022, at 4:29 AM, Ilya Leoshkevich <iii@xxxxxxxxxxxxx> wrote: > > > > On 1/23/22 02:03, Song Liu wrote: >>> On Jan 21, 2022, at 6:12 PM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: >>> >>> On Fri, Jan 21, 2022 at 5:30 PM Song Liu <songliubraving@xxxxxx> wrote: >>>> >>>> >>>> >>>>> On Jan 21, 2022, at 5:12 PM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: >>>>> >>>>> On Fri, Jan 21, 2022 at 5:01 PM Song Liu <songliubraving@xxxxxx> wrote: >>>>>> >>>>>> In this way, we need to allocate rw_image here, and free it in >>>>>> bpf_jit_comp.c. This feels a little weird to me, but I guess that >>>>>> is still the cleanest solution for now. >>>>> >>>>> You mean inside bpf_jit_binary_alloc? >>>>> That won't be arch independent. >>>>> It needs to be split into generic piece that stays in core.c >>>>> and callbacks like bpf_jit_fill_hole_t >>>>> or into multiple helpers with prep in-between. >>>>> Don't worry if all archs need to be touched. >>>> >>>> How about we introduce callback bpf_jit_set_header_size_t? Then we >>>> can split x86's jit_fill_hole() into two functions, one to fill the >>>> hole, the other to set size. The rest of the logic gonna stay the same. >>>> >>>> Archs that do not use bpf_prog_pack won't need bpf_jit_set_header_size_t. >>> >>> That's not any better. >>> >>> Currently the choice of bpf_jit_binary_alloc_pack vs bpf_jit_binary_alloc >>> leaks into arch bits and bpf_prog_pack_max_size() doesn't >>> really make it generic. >>> >>> Ideally all archs continue to use bpf_jit_binary_alloc() >>> and magic happens in a generic code. >>> If not then please remove bpf_prog_pack_max_size(), >>> since it doesn't provide much value and pick >>> bpf_jit_binary_alloc_pack() signature to fit x86 jit better. >>> It wouldn't need bpf_jit_fill_hole_t callback at all. >>> Please think it through so we don't need to redesign it >>> when another arch will decide to use huge pages for bpf progs. >>> >>> cc-ing Ilya for ideas on how that would fit s390. >> I guess we have a few different questions here: >> 1. Can we use bpf_jit_binary_alloc() for both regular page and shared >> huge page? >> I think the answer is no, as bpf_jit_binary_alloc() allocates a rw >> buffer, and arch calls bpf_jit_binary_lock_ro after JITing. The new >> allocator will return a slice of a shared huge page, which is locked >> RO before JITing. >> 2. The problem with bpf_prog_pack_max_size() limitation. >> I think this is the worst part of current version of bpf_prog_pack, >> but it shouldn't be too hard to fix. I will remove this limitation >> in the next version. >> 3. How to set proper header->size? >> I guess we can introduce something similar to bpf_arch_text_poke() >> for this? >> My proposal for the next version is: >> 1. No changes to archs that do not use huge page, just keep using >> bpf_jit_binary_alloc. >> 2. For x86_64 (and other arch that would support bpf program on huge >> pages): >> 2.1 arch/bpf_jit_comp calls bpf_jit_binary_alloc_pack() to allocate >> an RO bpf_binary_header; >> 2.2 arch allocates a temporary buffer for JIT. Once JIT is done, >> use text_poke_copy to copy the code to the RO bpf_binary_header. > > Are arches expected to allocate rw buffers in different ways? If not, > I would consider putting this into the common code as well. Then > arch-specific code would do something like > > header = bpf_jit_binary_alloc_pack(size, &prg_buf, &prg_addr, ...); > ... > /* > * Generate code into prg_buf, the code should assume that its first > * byte is located at prg_addr. > */ > ... > bpf_jit_binary_finalize_pack(header, prg_buf); > > where bpf_jit_binary_finalize_pack() would copy prg_buf to header and > free it. I think this should work. We will need an API like: bpf_arch_text_copy, which uses text_poke_copy() for x86_64 and s390_kernel_write() for x390. We will use bpf_arch_text_copy to 1) write header->size; 2) do finally copy in bpf_jit_binary_finalize_pack(). The syntax of bpf_arch_text_copy is quite different to existing bpf_arch_text_poke, so I guess a new API is better. > > If this won't work, I also don't see any big problems in the scheme > that you propose (especially if bpf_prog_pack_max_size() limitation is > gone). > > [...] > > Btw, are there any existing benchmarks that I can use to check whether > this is worth enabling on s390? Unfortunately, we don't have a benchmark to share. Most of our benchmarks are shadow tests that cannot run out of production environment. We have issues with iTLB misses for most of our big services. A typical system may see hundreds of iTLB misses per million instruction. Some sched_cls programs are often the top triggers of these iTLB misses. Thanks, Song