Hi Paolo,
Did you get a chance to take a look on this issue ?
Regards,
Pradeep
On 2019-09-06 21:09, ppvk@xxxxxxxxxxxxxx wrote:
Hi Paolo,
Good Evening!!
We started using BFQ IO Sched. from last 3 Months. it was awesome !!
Its been running great, but we just hit a "free-after-use" crash.
We are running stability test (includes heavy IO, video playbacks etc)
for 48~72hrs. We encountered below error. This crash is seen 2 to 3
times during our stability testing.
120.444910: <6> binder: undelivered transaction 163882, process
died.
120.470254: <6> binder: undelivered transaction 163953, process
died.
120.474841: <6> Unable to handle kernel paging request at virtual
address 006b6b6b6b6b7773
120.484651: <6> Mem abort info:
120.487534: <6> ESR = 0x96000004
120.490680: <6> Exception class = DABT (current EL), IL = 32
bits
120.496760: <6> SET = 0, FnV = 0
120.499906: <6> EA = 0, S1PTW = 0
120.503145: <6> Data abort info:
120.506111: <6> ISV = 0, ISS = 0x00000004
120.510056: <6> CM = 0, WnR = 0
120.513106: <6> [006b6b6b6b6b7773] address between user and
kernel address ranges
120.520437: <6> Internal error: Oops: 96000004 [#1] PREEMPT SMP
120.522998: <6> binder_alloc: 12672: binder_alloc_buf, no vma
120.526162: <6> Modules linked in: wlan(O) machine_dlkm(O)
wcd938x_slave_dlkm(O) wcd938x_dlkm(O) wcd9xxx_dlkm(O) mbhc_dlkm(O)
tx_macro_dlkm(O) rx_macro_dlkm(O) va_macro_dlkm(O) wsa_macro_dlkm(O)
swr_ctrl_dlkm(O) bolero_cdc_dlkm(O) wsa881x_dlkm(O) wcd_core_dlkm(O)
stub_dlkm(O) hdmi_dlkm(O) swr_dlkm(O) pinctrl_lpi_dlkm(O) usf_dlkm(O)
native_dlkm(O) platform_dlkm(O) q6_dlkm(O) adsp_loader_dlkm(O)
apr_dlkm(O) snd_event_dlkm(O) q6_notifier_dlkm(O) q6_pdr_dlkm(O)
120.537723: <6> binder: 1280:9939 transaction failed 29189/-3,
size 80-0 line 3277
120.572951: <6> Process kworker/4:1H (pid: 310, stack limit =
0xffffff8010670000)
120.572958: <6> CPU: 4 PID: 310 Comm: kworker/4:1H Tainted: G S
W O 4.19.66+ #1
120.572961: <6> Hardware name: Qualcomm Technologies, Inc. Lito
MTP (DT)
120.572973: <6> Workqueue: kblockd blk_mq_run_work_fn
120.572980: <2> pstate: a0c00085 (NzCv daIf +PAN +UAO)
120.572988: <2> pc : bfqg_stats_update_idle_time+0x14/0x50
120.572992: <2> lr : bfq_dispatch_request+0x398/0x948
120.572995: <2> sp : ffffff8010673bb0
120.572998: <2> x29: ffffff8010673bc0 x28: 0000000000000001
120.573004: <2> x27: 0000001c0c5e468f x26: fffffff6aa182a80
120.573011: <2> x25: 0000000000000000 x24: fffffff6aa197a80
120.573018: <2> x23: fffffff63feb4008 x22: 0000000000000001
120.573024: <2> x21: fffffff6ab1eee08 x20: fffffff63feb4008
120.573030: <2> x19: 6b6b6b6b6b6b6ad3 x18: 0000000000000004
120.573037: <2> x17: 000000000000775f x16: 0000000000004e20
120.573043: <2> x15: fffffff5dcd83d40 x14: 2faa9e9b959219ea
120.573049: <2> x13: 0000000000000004 x12: 0000000089c9fa08
120.573056: <2> x11: 2a768cc68edc0700 x10: 0000000000000000
120.573062: <2> x9 : 0000000000000001 x8 : 6b6b6b6b6b6b6b6b
120.573069: <2> x7 : ffffff943cfb1678 x6 : 0000000000000000
120.573075: <2> x5 : 0000000000000080 x4 : 0000000000000001
120.573081: <2> x3 : 0000000000000000 x2 : 0000000000000000
120.573087: <2> x1 : 0000000000000000 x0 : 6b6b6b6b6b6b6ad3
120.573098: <6>
:
:
121.185250: <2> Call trace:
121.187772: <2> bfqg_stats_update_idle_time+0x14/0x50
121.192701: <2> bfq_dispatch_request+0x398/0x948
121.197188: <2> blk_mq_do_dispatch_sched+0x84/0x118
121.198271: <6> CPU7: update max cpu_capacity 1024
121.206504: <2> blk_mq_sched_dispatch_requests+0x130/0x190
121.211874: <2> __blk_mq_run_hw_queue+0xcc/0x148
121.216359: <2> blk_mq_run_work_fn+0x24/0x30
121.220490: <2> process_one_work+0x328/0x6b0
121.224620: <2> worker_thread+0x330/0x4d0
121.228476: <2> kthread+0x128/0x138
121.231806: <2> ret_from_fork+0x10/0x1c
121.235484: <6> Code: a9017bfd 910043fd aa0003f3 d503201f
(39728268)
121.241751: <6> ---[ end trace 83177e232bbbf1f0 ]---
121.246514: <6> Kernel panic - not syncing: Fatal exception
121.251896: <6> SMP: stopping secondary CPUs
121.255938: <6> CPU1: stopping
After doing code walk on bfq-iosched, i could see that in fn.
bfq_dispatch_request()
in_serv_queue is assigned with bfqd->in_service_queue.
and after this, we are getting the request using
__bfq_dispatch_request() fn.
In __bfq_dispatch_request() fn., we see that bfqd->in_service_queue is
being updated via. __bfq_dispatch_request --> bfq_select_queue -->
bfq_set_in_service_queue --> __bfq_set_in_service_queue.
I presumed that the in service queue time/budget got expired and it
selected a new queue from a different service group.
(If there are no more queues to be served, it could pick the new queue
from it's next group of service tree.) during this updation, we are
not updating the variable "in_serv_queue", we are passing same it to
bfq_update_dispatch_stats() fn. and resulting in above crash while
updating bfqg idle time stats.
I also seen the comments written just above
bfgq_stats_update_idle_time() fn. as
" Since the idle timer has been disabled, in_serv_queue contained some
request when __bfq_dispatch_request
was invoked above, which implies that rq was picked exactly from
in_serv_queue. Thus in_serv_queue == bfqq,
and is therefore guaranteed to exist because of the above arguments."
if it is so, can we suppose to use "bfqq" instead of "in_serv_queue"
as bfqq is extracted from request "rq" and the "rq" is updated
properly by __bfq_dispatch_request() fn. ?
Please correct me if my understanding was wrong.
Many thanks in advance !!
Best Regards,
Pradeep