Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

Paolo Valente <paolo.valente@xxxxxxxxxx> · Thu, 25 May 2017 09:10:59 +0200

> Il giorno 24 mag 2017, alle ore 17:47, Tejun Heo <tj@xxxxxxxxxx> ha scritto:
> 
> Hello,
> 
> On Wed, May 24, 2017 at 05:43:18PM +0100, Paolo Valente wrote:
>>> so none of the above objects can be destroyed before the request is
>>> done.
>> 
>> ... the issue seems just to move to a more subtle position: cfq is ok,
>> because it protects itself with rq lock, but blk-mq schedulers don't.
>> So, the race that leads to the (real) crashes reported by people may
>> actually be:
> 
> Oh, I was just thinking about !mq paths the whole time.
> 
>> 1 blkg_lookup executed on a blkg being destroyed: the scheduler gets a
>> copy of the content of the blkg, but the rcu mechanism doesn't prevent
>> destruction from going on
>> 2 blkg_get gets executed on the copy of the original blkg
> 
> So, we can't do that.  We should look up and bump the ref and use the
> original copy.  We probably should switch blkgs to use percpu-refs.
> 

Ok.  So, just to better understand: as of now, i.e., before you make
the changes you are proposing, the address returned by blkg_lookup can
be used safely only if one both invokes blkg_lookup and gets a
reference, while holding the rq lock all the time.  But then, before
the changes you propose, what is the remaining role of rcu protection
here?  Are there places where the value returned by blkg_lookup is
actually used safely without getting a reference the returned blkg?

Anyway, I'm willing to help with your proposal, if you think I can be
of any help at some point.  In this respect, consider that I'm not an
expert of percpu-refs either.

Finally, I guess that the general fix you have in mind may not be
ready shortly.  So, I'll proceed with my temporary fix for the moment.
In particular, I will
1) fix the typo reported by Jens;
2) add a note stating that this is a temporary fix;
3) if needed, modify commit log and comments in the diffs, to better
    describe the general problem, and the actual critical race.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun