> Il giorno 24 mag 2017, alle ore 15:50, Tejun Heo <tj@xxxxxxxxxx> ha scritto: > > Hello, Paolo. > > On Wed, May 24, 2017 at 12:53:26PM +0100, Paolo Valente wrote: >> Exact, but even after all blkgs, as well as the cfq_group and pd, are >> gone, the children cfq_queues of the gone cfq_group continue to point >> to unexisting objects, until new cfq_set_requests are executed for >> those cfq_queues. To try to make this statement clearer, here is the >> critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group, >> say cfqg: >> >> 1 cfq_set_request for a request rq of cfqq >> 2 removal of (the process associated with cfqq) from bfqg >> 3 destruction of the blkg that bfqg is associated with >> 4 destruction of the blkcg the above blkg belongs to >> 5 destruction of the pd pointed to by cfqg, and of cfqg itself >> !!!-> from now on cfqq->cfqg is a dangling reference <-!!! >> 6 execution of cfq functions, different from cfq_set_request, on cfqq >> . cfq_insert, cfq_dispatch, cfq_completed_rq, ... >> 7 execution of a new cfq_set_request for cfqq >> -> now cfqq->cfqg is again a sane pointer <- >> >> Every function executed at step 6 sees a dangling reference for >> cfqq->cfqg. >> >> My fix for caching data doesn't solve this more serious problem. >> >> Where have I been mistaken? > > Hmmm... cfq_set_request() invokes cfqg_get() which increases refcnt on > the blkg, which should pin everything down till the request is done, Yes, I missed that step, sorry. Still ... > so none of the above objects can be destroyed before the request is > done. > ... the issue seems just to move to a more subtle position: cfq is ok, because it protects itself with rq lock, but blk-mq schedulers don't. So, the race that leads to the (real) crashes reported by people may actually be: 1 blkg_lookup executed on a blkg being destroyed: the scheduler gets a copy of the content of the blkg, but the rcu mechanism doesn't prevent destruction from going on 2 blkg_get gets executed on the copy of the original blkg 3 subsequent scheduler operations involving that stale blkg lead to the dangling-pointer accesses we have already discussed Could you patiently tell me whether I'm still wrong? Thanks, Paolo > Thanks. > > -- > tejun