Hou Tao <houtao@xxxxxxxxxxxxxxx> writes: > Hi, > > On 8/27/2023 10:53 PM, Yonghong Song wrote: >> >> >> On 8/27/23 1:37 AM, Björn Töpel wrote: >>> Björn Töpel <bjorn@xxxxxxxxxx> writes: >>> >>>> Hou Tao <houtao@xxxxxxxxxxxxxxx> writes: >>>> >>>>> Hi, >>>>> >>>>> On 8/26/2023 5:23 PM, Björn Töpel wrote: >>>>>> Hou Tao <houtao@xxxxxxxxxxxxxxx> writes: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote: >>>>>>>> >>>>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote: >>>>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf >>>>>>>>> selftests on bpf-next 9e3b47abeb8f. >>>>>>>>> >>>>>>>>> I'm able to reproduce the hang by multiple runs of: >>>>>>>>> | ./test_progs -a link_api -a linked_list >>>>>>>>> I'm currently investigating that. >>>>>>>>> >>>>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit: >>>>>>>>> | ------------[ cut here ]------------ >>>>>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 >>>>>>>>> bpf_mem_refill+0x1fc/0x206 >>>>>>>>> | Modules linked in: bpf_testmod(OE) >>>>>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE >>>>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2 >>>>>>>>> | Hardware name: riscv-virtio,qemu (DT) >>>>>>>>> | epc : bpf_mem_refill+0x1fc/0x206 >>>>>>>>> | ra : irq_work_single+0x68/0x70 >>>>>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : >>>>>>>>> ff2000000001be20 >>>>>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : >>>>>>>>> 0000000000046600 >>>>>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : >>>>>>>>> ff2000000001be70 >>>>>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : >>>>>>>>> ff600003fef4b000 >>>>>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 : >>>>>>>>> 0000000000000060 >>>>>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 : >>>>>>>>> 0000000000735049 >>>>>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : >>>>>>>>> 0000000000001000 >>>>>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : >>>>>>>>> ffffffff82d6bd30 >>>>>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: >>>>>>>>> 000000000000ffff >>>>>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : >>>>>>>>> 0000000000000000 >>>>>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000 >>>>>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause: >>>>>>>>> 0000000000000003 >>>>>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206 >>>>>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70 >>>>>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36 >>>>>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66 >>>>>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4 >>>>>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8 >>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36 >>>>>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa >>>>>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88 >>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36 >>>>>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e >>>>>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86 >>>>>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98 >>>>>>>>> | ---[ end trace 0000000000000000 ]--- >>>>>>>>> >>>>>>>>> Code: >>>>>>>>> | static void free_bulk(struct bpf_mem_cache *c) >>>>>>>>> | { >>>>>>>>> | struct bpf_mem_cache *tgt = c->tgt; >>>>>>>>> | struct llist_node *llnode, *t; >>>>>>>>> | unsigned long flags; >>>>>>>>> | int cnt; >>>>>>>>> | >>>>>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size); >>>>>>>>> | ... >>>>>>>>> >>>>>>>>> I'm not well versed in the memory allocator; Before I dive into >>>>>>>>> it -- >>>>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit? >>>>>>>> Maybe take a look at the patch >>>>>>>> 822fb26bdb55 bpf: Add a hint to allocated objects. >>>>>>>> >>>>>>>> In the above patch, we have >>>>>>>> >>>>>>>> + /* >>>>>>>> + * Remember bpf_mem_cache that allocated this object. >>>>>>>> + * The hint is not accurate. >>>>>>>> + */ >>>>>>>> + c->tgt = *(struct bpf_mem_cache **)llnode; >>>>>>>> >>>>>>>> I suspect that the warning may be related to the above. >>>>>>>> I tried the above ./test_progs command line (running multiple >>>>>>>> at the same time) and didn't trigger the issue. >>>>>>> The extra 8-bytes before the freed pointer is used to save the >>>>>>> pointer >>>>>>> of the original bpf memory allocator where the freed pointer came >>>>>>> from, >>>>>>> so unit_free() could free the pointer back to the original >>>>>>> allocator to >>>>>>> prevent alloc-and-free unbalance. >>>>>>> >>>>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do >>>>>>> not >>>>>>> find anything suspicious after checking linked_list. Another >>>>>>> possibility >>>>>>> is that there is write-after-free problem which corrupts the extra >>>>>>> 8-bytes before the freed pointer. Could you please apply the >>>>>>> following >>>>>>> debug patch to check whether or not the extra 8-bytes are >>>>>>> corrupted ? >>>>>> Thanks for getting back! >>>>>> >>>>>> I took your patch for a run, and there's a hit: >>>>>> | bad cache ff5ffffffffe8570: got size 96 work >>>>>> ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work >>>>>> ffffffff801b19c8 >>>>> >>>>> The extra 8-bytes are not corrupted. Both of these two >>>>> bpf_mem_cache are >>>>> valid and there are in the cache array defined in bpf_mem_caches. BPF >>>>> memory allocator allocated the pointer from 96-bytes sized-cache, >>>>> but it >>>>> tried to free the pointer through 128-bytes sized-cache. >>>>> >>>>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr - >>>>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the >>>>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please >>>>> check the value of KMALLOC_MIN_SIZE in your kernel .config and >>>>> using the >>>>> following command to check whether there is 96-bytes slab in your >>>>> system: >>>> >>>> KMALLOC_MIN_SIZE is 64. >>>> >>>>> $ cat /proc/slabinfo |grep kmalloc-96 >>>>> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 >>>>> 0 : slabdata 0 0 0 >>>>> kmalloc-96 1865 2268 96 42 1 : tunables 0 0 >>>>> 0 : slabdata 54 54 0 >>>>> >>>>> In my system, slab has 96-bytes cached, so grep outputs something, >>>>> but I >>>>> think there will no output in your system. >>>> >>>> You're right! No kmalloc-96. >>> >>> To get rid of the warning, limit available sizes from >>> bpf_mem_alloc_init()? > > It is not enough. We need to adjust size_index accordingly during > initialization. Could you please try the attached patch below ? It is > not a formal patch and I am considering to disable prefilling for these > redirected bpf_mem_caches. Sorry for the slow response; I'll take it for a spin today. Björn