Re: mdadm's raid1 will not eliminate abnormal disk after 5 seconds under IO pressure

李春 <pickup112@xxxxxxxxx> · Thu, 31 Jan 2019 16:33:05 +0800

Ok, thanks.

Xiao Ni <xni@xxxxxxxxxx> 于2019年1月31日周四 下午2:25写道：
>
> Yes, the failfast is used to fix the problem you described. It can't remove
>
> the active disk until all pending I/O finish without failfast. If there
> is no
>
> pending I/O, it can be removed immediately.
>
> Thanks
>
> Xiao
>
>
> On 01/30/2019 10:14 PM, 李春 wrote:
> > I have read the description of the failfast feature. According to the
> > phenomenon, it may  be not the problem of failfast.
> > Because when there are no io pressure, after stop the disk export on
> > the storage node,  the disk will be automatically eliminate from the
> > md disk.
> > However, if there is continuous IO pressure, the disk will not be
> > automatically removed, and the disk will be eliminated immediately
> > after the IO pressure is stopped.
> >
> > Xiao Ni <xni@xxxxxxxxxx> 于2019年1月30日周三 下午5:15写道：
> >>
> >>
> >> On 01/30/2019 03:25 PM, Jack Wang wrote:
> >>> 李春 <pickup112@xxxxxxxxx> 于2019年1月30日周三 上午7:08写道：
> >>>> # Description of problem:
> >>>> We loaded a disk from two network of storage node via iscsi, merged
> >>>> into a disk through multipath, and made a raid1 with  local disk by
> >>>> mdadm.
> >>>> However, when the storage machine of iscsi disk rebooted,  raid1 disk
> >>>> does not automatically eject the abnormal disk when there are some IO
> >>>> pressure.
> >>>>
> >>>> # Version-Release number of selected component (if applicable):
> >>>> vermagic: 2.6.32-573.el6.x86_64 SMP mod_unload modversions
> >>>> srcversion: 39AAB97325332236F2FFCA9
> >>>>
> >>>> # How reproducible:
> >>>> always
> >>>>
> >>>> # Steps to Reproduce:
> >>>> 1. export a disk from storage node
> >>>> 2. load the disk on another node and merge it with multipath
> >>>> 3. assemble a local disk and the multipath by madm to a raid1 disk
> >>>> 4. reboot
> >>>>
> >>>> # Actual results:
> >>>> * multipath disk not eject from raid1 disk under Fio pressure
> >>>> * multipath disk eject immediately from raid1 disk when stop Fio pressure
> >>>>
> >>>> # Expected results:
> >>>> * multipath disk eject immediately from raid1 disk under Fio pressure
> >>>>
> >>>> # Additional info:
> >>>> We have done the following tests:
> >>>> * In rhel6.7 with kernel of 2.6.32-573.el6.x86_64 test, mdadm's raid1
> >>>> will eliminate the abnormal disk after 5 seconds without IO pressure
> >>>> * In rhel6.7 with kernel of 2.6.32-573.el6.x86_64 test, in the case of
> >>>> IO pressure, mdadm's raid1 will not reject the abnormal disk, until
> >>>> the IO pressure stops, the disk will be removed.
> >>>> * In rhel7.4 with kernel of 3.10.0-693.el7.x86_64 test, mdadm's raid1
> >>>> will eliminate the abnormal disk after 5 seconds without IO pressure
> >>>> * In rhel7.4 with kernel of  3.10.0-693.el7.x86_64 test, mdadm's raid1
> >>>> will eliminate abnormal disk after 5 seconds under IO pressure
> >>>>
> >>>> Thanks for your help.
> >>> Sounds like, you want failfast feature in upstream, not sure if RH
> >>> backport it into their kernel.
> >> Thanks for the reporting and analysis.
> >> rhel6 is in the period that it's recommended to fix bugs only. So it
> >> doesn't backport some features.
> >> I'll have a try to backport this to rhel6.
> >>
> >> Regards
> >> Xiao
> >
> >
>

-- 
李春 Pickup Li
产品研发部  首席架构师

www.woqutech.com
杭州沃趣科技股份有限公司

杭州市滨江区滨安路1190号智汇中心A座1004室  310052
Hangzhou WOQU Technology Co., Ltd.
Room 1004, Building A, D-innovation Center, No. 1190, Bin' an road,
Hangzhou 310052

T：(0571) 87770835
M：(86)18989451982
F：(0571) 86805750
E：pickup.li@xxxxxxxxxxxx