On Mon, Aug 28, 2023 at 6:57 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > Hi, > > On 8/27/2023 10:53 PM, Yonghong Song wrote: > > > > > > On 8/27/23 1:37 AM, Björn Töpel wrote: > >> Björn Töpel <bjorn@xxxxxxxxxx> writes: > >> > >>> Hou Tao <houtao@xxxxxxxxxxxxxxx> writes: > >>> > >>>> Hi, > >>>> > >>>> On 8/26/2023 5:23 PM, Björn Töpel wrote: > >>>>> Hou Tao <houtao@xxxxxxxxxxxxxxx> writes: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote: > >>>>>>> > >>>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote: > >>>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf > >>>>>>>> selftests on bpf-next 9e3b47abeb8f. > >>>>>>>> > >>>>>>>> I'm able to reproduce the hang by multiple runs of: > >>>>>>>> | ./test_progs -a link_api -a linked_list > >>>>>>>> I'm currently investigating that. > >>>>>>>> > >>>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit: > >>>>>>>> | ------------[ cut here ]------------ > >>>>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 > >>>>>>>> bpf_mem_refill+0x1fc/0x206 > >>>>>>>> | Modules linked in: bpf_testmod(OE) > >>>>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE > >>>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2 > >>>>>>>> | Hardware name: riscv-virtio,qemu (DT) > >>>>>>>> | epc : bpf_mem_refill+0x1fc/0x206 > >>>>>>>> | ra : irq_work_single+0x68/0x70 > >>>>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : > >>>>>>>> ff2000000001be20 > >>>>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : > >>>>>>>> 0000000000046600 > >>>>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : > >>>>>>>> ff2000000001be70 > >>>>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : > >>>>>>>> ff600003fef4b000 > >>>>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 : > >>>>>>>> 0000000000000060 > >>>>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 : > >>>>>>>> 0000000000735049 > >>>>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : > >>>>>>>> 0000000000001000 > >>>>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : > >>>>>>>> ffffffff82d6bd30 > >>>>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: > >>>>>>>> 000000000000ffff > >>>>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : > >>>>>>>> 0000000000000000 > >>>>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000 > >>>>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause: > >>>>>>>> 0000000000000003 > >>>>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206 > >>>>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70 > >>>>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36 > >>>>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66 > >>>>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4 > >>>>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8 > >>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36 > >>>>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa > >>>>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88 > >>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36 > >>>>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e > >>>>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86 > >>>>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98 > >>>>>>>> | ---[ end trace 0000000000000000 ]--- > >>>>>>>> > >>>>>>>> Code: > >>>>>>>> | static void free_bulk(struct bpf_mem_cache *c) > >>>>>>>> | { > >>>>>>>> | struct bpf_mem_cache *tgt = c->tgt; > >>>>>>>> | struct llist_node *llnode, *t; > >>>>>>>> | unsigned long flags; > >>>>>>>> | int cnt; > >>>>>>>> | > >>>>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size); > >>>>>>>> | ... > >>>>>>>> > >>>>>>>> I'm not well versed in the memory allocator; Before I dive into > >>>>>>>> it -- > >>>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit? > >>>>>>> Maybe take a look at the patch > >>>>>>> 822fb26bdb55 bpf: Add a hint to allocated objects. > >>>>>>> > >>>>>>> In the above patch, we have > >>>>>>> > >>>>>>> + /* > >>>>>>> + * Remember bpf_mem_cache that allocated this object. > >>>>>>> + * The hint is not accurate. > >>>>>>> + */ > >>>>>>> + c->tgt = *(struct bpf_mem_cache **)llnode; > >>>>>>> > >>>>>>> I suspect that the warning may be related to the above. > >>>>>>> I tried the above ./test_progs command line (running multiple > >>>>>>> at the same time) and didn't trigger the issue. > >>>>>> The extra 8-bytes before the freed pointer is used to save the > >>>>>> pointer > >>>>>> of the original bpf memory allocator where the freed pointer came > >>>>>> from, > >>>>>> so unit_free() could free the pointer back to the original > >>>>>> allocator to > >>>>>> prevent alloc-and-free unbalance. > >>>>>> > >>>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do > >>>>>> not > >>>>>> find anything suspicious after checking linked_list. Another > >>>>>> possibility > >>>>>> is that there is write-after-free problem which corrupts the extra > >>>>>> 8-bytes before the freed pointer. Could you please apply the > >>>>>> following > >>>>>> debug patch to check whether or not the extra 8-bytes are > >>>>>> corrupted ? > >>>>> Thanks for getting back! > >>>>> > >>>>> I took your patch for a run, and there's a hit: > >>>>> | bad cache ff5ffffffffe8570: got size 96 work > >>>>> ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work > >>>>> ffffffff801b19c8 > >>>> > >>>> The extra 8-bytes are not corrupted. Both of these two > >>>> bpf_mem_cache are > >>>> valid and there are in the cache array defined in bpf_mem_caches. BPF > >>>> memory allocator allocated the pointer from 96-bytes sized-cache, > >>>> but it > >>>> tried to free the pointer through 128-bytes sized-cache. > >>>> > >>>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr - > >>>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the > >>>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please > >>>> check the value of KMALLOC_MIN_SIZE in your kernel .config and > >>>> using the > >>>> following command to check whether there is 96-bytes slab in your > >>>> system: > >>> > >>> KMALLOC_MIN_SIZE is 64. > >>> > >>>> $ cat /proc/slabinfo |grep kmalloc-96 > >>>> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 > >>>> 0 : slabdata 0 0 0 > >>>> kmalloc-96 1865 2268 96 42 1 : tunables 0 0 > >>>> 0 : slabdata 54 54 0 > >>>> > >>>> In my system, slab has 96-bytes cached, so grep outputs something, > >>>> but I > >>>> think there will no output in your system. > >>> > >>> You're right! No kmalloc-96. > >> > >> To get rid of the warning, limit available sizes from > >> bpf_mem_alloc_init()? > > It is not enough. We need to adjust size_index accordingly during > initialization. Could you please try the attached patch below ? It is > not a formal patch and I am considering to disable prefilling for these > redirected bpf_mem_caches. > > > > Do you know why your system does not have kmalloc-96? > > According to the implementation of setup_kmalloc_cache_index_table() and > create_kmalloc_caches(), when KMALLOC_MIN_SIZE is greater than 64, > kmalloc-96 will be omitted. If KMALLOC_MIN_SIZE is greater than 128, > kmalloc-192 will be omitted as well. Great catch. The fix looks good. Please submit it officially and add an error check to bpf_mem_alloc_init() that verifies that ksize() matches the expectations. The alternative is to use kmalloc_size_roundup() during alloc for checking instead of ksize(). Technically we can use kmalloc_size_roundup in unit_alloc() and avoid setup_kmalloc_cache_index_table()-like copy paste, but performance overhead might be too high. So your patch + error check at bpf_mem_alloc_init() is preferred.