Re: [PATCH] md linear: fix a race between linear_add() and linear_congested()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2017/1/26 上午8:04, NeilBrown wrote:
> On Wed, Jan 25 2017, Shaohua Li wrote:
> 
>> On Wed, Jan 25, 2017 at 07:15:43PM +0800, colyli@xxxxxxx wrote:
>>> Recently I receie a report that on Linux v3.0 based kerenl, hot
>>> add disk to a md linear device causes kernel crash at
>>> linear_congested(). From the crash image analysis, I find in
>>> linear_congested(), mddev->raid_disks contains value N, but
>>> conf->disks[] only has N-1 pointers available. Then a pointer
>>> deference to a NULL pointer crashes the kernel.
>>> 
>>> There is a race between linear_add() and linear_congested(),
>>> RCU stuffs used in these two functions cannot avoid the race.
>>> Since Linuv v4.0 RCU code is replaced by introducing
>>> mddev_suspend().  After checking the upstream code, it seems
>>> linear_congested() is not called in generic_make_request() code
>>> patch, so mddev_suspend() cannot provent it from being called.
>>> The possible race still exists.
>>> 
>>> Here I explain how the race still exists in current code.  For
>>> a machine has many CPUs, on one CPU, linear_add() is called to
>>> add a hard disk to a md linear device; at the same time on
>>> other CPU, linear_congested() is called to detect whether this
>>> md linear device is congested before issuing an I/O request
>>> onto it.
>>> 
>>> Now I use a possible code execution time sequence to demo how
>>> the possible race happens,
>>> 
>>> seq    linear_add()                linear_congested() 0
>>> conf=mddev->private 1   oldconf=mddev->private 2
>>> mddev->raid_disks++ 3                              for (i=0;
>>> i<mddev->raid_disks;i++) 4
>>> bdev_get_queue(conf->disks[i].rdev->bdev) 5
>>> mddev->private=newconf
>> 
>> Good catch, this makes a lot of sense. However, this looks like
>> an incomplete fix. step 0 will get the old conf, after step 5,
>> linear_add will free the old conf. So it's possible
>> linear_congested() will use the freed old conf. I think this is
>> more likely to happen. The easist fix maybe put rcu_lock in 
>> linear_congested and free the old conf in a rcu callback.
> 
> We used to use kfree_rcu() but removed it in
> 
> Commit: 3be260cc18f8 ("md/linear: remove rcu protections in favour
> of suspend/resume")
> 
> when we changed to suspend/resume the device.  That stops all IO,
> but doesn't stop the ->congested call.
> 
> So we probably should re-introduce kfree_rcu() to free oldconf. It
> might also be good to store a copy of raid_disks in linear_conf,
> like we do with r5conf, the ensure we never us inconsistent 
> ->raid_disks and ->disks[]

Hi Neil,

I just send out v2 patch which adds RCU stuffs back. I test it on my
small server, it survives.

Once thing I want to confirm here is the memory barrier in linear_add().

219         mddev_suspend(mddev);
220         oldconf = rcu_dereference(mddev->private);
221         rcu_assign_pointer(mddev->private, newconf);
222         smp_mb();
223         mddev->raid_disks++;
224         md_set_array_sectors(mddev, linear_size(mddev, 0, 0));
225         set_capacity(mddev->gendisk, mddev->array_sectors);
226         mddev_resume(mddev);
227         revalidate_disk(mddev->gendisk);
228         call_rcu(&oldconf->rcu, free_conf);

At LINE 222, I add a smp_mb(), from Documentations/memory-barrier.txt,
my understand is here I need a smp_wmb() or smp_mb(). I see other
places all use smp_mb() so I choose the stronger one -- smp_mb().

But from Documentation/whatisRCU.txt, it says about
rcu_assign_pointer(): "This function returns he new value, and also
executes any memory-barrier instructions required for a given CPU
architecture." So it seems smp_mb() at LINE 222 is unnecessary.

In v2 patch, I keep smp_mb() although I think it is unnecessary. I
will remove it if you or Shaohua may confirm it is unncessary as I think.


Another question is, I try to look at the code about r5conf, but I
still have no idea how to store a copy of r5conf. Could you please to
give me more hint ?

Thanks.

Coly
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]