Re: Cephfs IO halt on Node failure

Yoann Moulin <yoann.moulin@xxxxxxx> · Mon, 25 May 2020 17:14:24 +0200

Hello,

> Sorry for the late reply.
> I have pasted crush map in below url : https://pastebin.com/ASPpY2VB
> and this my osd tree output and this issue are only when i use it with
> filelayout.

could send the output of "ceph osd pool ls detail" please ?

Yoann

> ID CLASS WEIGHT    TYPE NAME          STATUS REWEIGHT PRI-AFF
> -1       327.48047 root default
> -3       109.16016     host strgsrv01
>  0   hdd   5.45799         osd.0          up  1.00000 1.00000
>  2   hdd   5.45799         osd.2          up  1.00000 1.00000
>  3   hdd   5.45799         osd.3          up  1.00000 1.00000
>  4   hdd   5.45799         osd.4          up  1.00000 1.00000
>  5   hdd   5.45799         osd.5          up  1.00000 1.00000
>  6   hdd   5.45799         osd.6          up  1.00000 1.00000
>  7   hdd   5.45799         osd.7          up  1.00000 1.00000
> 19   hdd   5.45799         osd.19         up  1.00000 1.00000
> 20   hdd   5.45799         osd.20         up  1.00000 1.00000
> 21   hdd   5.45799         osd.21         up  1.00000 1.00000
> 22   hdd   5.45799         osd.22         up  1.00000 1.00000
> 23   hdd   5.45799         osd.23         up  1.00000 1.00000
> -5       109.16016     host strgsrv02
>  1   hdd   5.45799         osd.1          up  1.00000 1.00000
>  8   hdd   5.45799         osd.8          up  1.00000 1.00000
>  9   hdd   5.45799         osd.9          up  1.00000 1.00000
> 10   hdd   5.45799         osd.10         up  1.00000 1.00000
> 11   hdd   5.45799         osd.11         up  1.00000 1.00000
> 12   hdd   5.45799         osd.12         up  1.00000 1.00000
> 24   hdd   5.45799         osd.24         up  1.00000 1.00000
> 25   hdd   5.45799         osd.25         up  1.00000 1.00000
> 26   hdd   5.45799         osd.26         up  1.00000 1.00000
> 27   hdd   5.45799         osd.27         up  1.00000 1.00000
> 28   hdd   5.45799         osd.28         up  1.00000 1.00000
> 29   hdd   5.45799         osd.29         up  1.00000 1.00000
> -7       109.16016     host strgsrv03
> 13   hdd   5.45799         osd.13         up  1.00000 1.00000
> 14   hdd   5.45799         osd.14         up  1.00000 1.00000
> 15   hdd   5.45799         osd.15         up  1.00000 1.00000
> 16   hdd   5.45799         osd.16         up  1.00000 1.00000
> 17   hdd   5.45799         osd.17         up  1.00000 1.00000
> 18   hdd   5.45799         osd.18         up  1.00000 1.00000
> 30   hdd   5.45799         osd.30         up  1.00000 1.00000
> 31   hdd   5.45799         osd.31         up  1.00000 1.00000
> 32   hdd   5.45799         osd.32         up  1.00000 1.00000
> 33   hdd   5.45799         osd.33         up  1.00000 1.00000
> 34   hdd   5.45799         osd.34         up  1.00000 1.00000
> 35   hdd   5.45799         osd.35         up  1.00000 1.00000
> 
> On Tue, May 19, 2020 at 12:16 PM Eugen Block <eblock@xxxxxx> wrote:
> 
>> Was that a typo and you mean you changed min_size to 1? I/O paus with
>> min_size 1 and size 2 is unexpected, can you share more details like
>> your crushmap and your osd tree?
>>
>>
>> Zitat von Amudhan P <amudhan83@xxxxxxxxx>:
>>
>>> Behaviour is same even after setting min_size 2.
>>>
>>> On Mon 18 May, 2020, 12:34 PM Eugen Block, <eblock@xxxxxx> wrote:
>>>
>>>> If your pool has a min_size 2 and size 2 (always a bad idea) it will
>>>> pause IO in case of a failure until the recovery has finished. So the
>>>> described behaviour is expected.
>>>>
>>>>
>>>> Zitat von Amudhan P <amudhan83@xxxxxxxxx>:
>>>>
>>>>> Hi,
>>>>>
>>>>> Crush rule is "replicated" and min_size 2 actually. I am trying to
>> test
>>>>> multiple volume configs in a single filesystem
>>>>> using file layout.
>>>>>
>>>>> I have created metadata pool with rep 3 (min_size2 and replicated
>> crush
>>>>> rule) and data pool with rep 3  (min_size2 and replicated crush rule).
>>>> and
>>>>> also  I have created multiple (replica 2, ec2-1 & ec4-2) pools and
>> added
>>>> to
>>>>> the filesystem.
>>>>>
>>>>> Using file layout I have set different data pool to a different
>> folders.
>>>> so
>>>>> I can test different configs in the same filesystem. all data pools
>>>>> min_size set to handle single node failure.
>>>>>
>>>>> Single node failure is handled properly when only having metadata pool
>>>> and
>>>>> one data pool (rep3).
>>>>>
>>>>> After adding additional data pool to fs, single node failure scenario
>> is
>>>>> not working.
>>>>>
>>>>> regards
>>>>> Amudhan P
>>>>>
>>>>> On Sun, May 17, 2020 at 1:29 AM Eugen Block <eblock@xxxxxx> wrote:
>>>>>
>>>>>> What’s your pool configuration wrt min_size and crush rules?
>>>>>>
>>>>>>
>>>>>> Zitat von Amudhan P <amudhan83@xxxxxxxxx>:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am using ceph Nautilus cluster with below configuration.
>>>>>>>
>>>>>>> 3 node's (Ubuntu 18.04) each has 12 OSD's, and mds, mon and mgr are
>>>>>> running
>>>>>>> in shared mode.
>>>>>>>
>>>>>>> The client mounted through ceph kernel client.
>>>>>>>
>>>>>>> I was trying to emulate a node failure when a write and read were
>>>> going
>>>>>> on
>>>>>>> (replica2) pool.
>>>>>>>
>>>>>>> I was expecting read and write continue after a small pause due to
>> a
>>>> Node
>>>>>>> failure but it halts and never resumes until the failed node is up.
>>>>>>>
>>>>>>> I remember I tested the same scenario before in ceph mimic where it
>>>>>>> continued IO after a small pause.
>>>>>>>
>>>>>>> regards
>>>>>>> Amudhan P
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>>
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 

-- 
Yoann Moulin
EPFL IC-IT
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx