On 12/17/20 3:07 AM, John Garry wrote: > References to old IO sched requests are currently cleared from the > tagset when freeing those requests; switching elevator or changing > request queue depth is such a scenario in which this occurs. > > However, this does not stop the potentially racy behaviour of freeing > and clearing a request reference between a tagset iterator getting a > reference to a request and actually dereferencing that request. > > Such a use-after-free can be triggered, as follows: > > ================================================================== > BUG: KASAN: use-after-free in bt_iter+0xa0/0x120 > Read of size 8 at addr ffff00108d589300 by task fio/3052 > > CPU: 32 PID: 3052 Comm: fio Tainted: GW > 5.10.0-rc4-64839-g2dcf1ee5054f #693 > Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon > D05 IT21 Nemo 2.0 RC0 04/18/2018 > Call trace: > dump_backtrace+0x0/0x2d0 > show_stack+0x18/0x68 > dump_stack+0x100/0x16c > print_address_description.constprop.12+0x6c/0x4e8 > kasan_report+0x130/0x200 > __asan_load8+0x9c/0xd8 > bt_iter+0xa0/0x120 > blk_mq_queue_tag_busy_iter+0x2d8/0x540 > blk_mq_in_flight+0x80/0xb8 > part_stat_show+0xd8/0x238 > dev_attr_show+0x44/0x90 > sysfs_kf_seq_show+0x128/0x1c8 > kernfs_seq_show+0xa0/0xb8 > seq_read_iter+0x1ec/0x6a0 > seq_read+0x1d0/0x250 > kernfs_fop_read+0x70/0x330 > vfs_read+0xe4/0x250 > ksys_read+0xc8/0x178 > __arm64_sys_read+0x44/0x58 > el0_svc_common.constprop.2+0xc4/0x1e8 > do_el0_svc+0x90/0xa0 > el0_sync_handler+0x128/0x178 > el0_sync+0x158/0x180 > > This is found experimentally by running fio on 2x SCSI disks - 1x disk > holds the root partition. Userspace is constantly triggering the tagset > iter from reading the root (gen)disk partition info. And so if the IO > sched is constantly changed on the other disk, eventually the UAF occurs, > as described above. Hi John, Something is not clear to me. The above call stack includes blk_mq_queue_tag_busy_iter(). That function starts with percpu_ref_tryget(&q->q_usage_counter) and ends with calling percpu_ref_put(&q->q_usage_counter). So it will only iterate over a tag set if q->q_usage_counter is live. However, both blk_mq_update_nr_requests() and elevator_switch() start with freezing the request queue. blk_mq_freeze_queue() starts with killing q->q_usage_counter and waits until that counter has dropped to zero. In other words, blk_mq_queue_tag_busy_iter() should not iterate over a tag set while a tag set is being freed or reallocated. Does this mean that we do not yet have a full explanation about why the above call stack can be triggered? Thanks, Bart.