Re: I/O freeze while a single node is down.

Daznis <daznis@xxxxxxxxx> · Tue, 13 Sep 2016 14:10:33 +0300



Yes that one has +2 OSD's on it.
root default {
        id -1           # do not change unnecessarily
        # weight 116.480
        alg straw
        hash 0  # rjenkins1
        item OSD-1 weight 36.400
        item OSD-2 weight 36.400
        item OSD-3 weight 43.680
}

rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

On Tue, Sep 13, 2016 at 1:51 PM, Sean Redmond <sean.redmond1@xxxxxxxxx> wrote:
> Hi,
>
> The host that is taken down has 12 disks in it?
>
> Have a look at the down PG's '18 pgs down' - I suspect this will be what is
> causing the I/O freeze.
>
> Is your cursh map setup correctly to split data over different hosts?
>
> Thanks
>
> On Tue, Sep 13, 2016 at 11:45 AM, Daznis <daznis@xxxxxxxxx> wrote:
>>
>> No, no errors about that. I have set noout before it happened, but it
>> still started recovery. I have added
>> nobackfill,norebalance,norecover,noscrub,nodeep-scrub once i noticed
>> it started doing crazy stuff. So recovery I/O stopped but the cluster
>> can't read any info. Only writes to cache layer.
>>
>>     cluster cdca2074-4c91-4047-a607-faebcbc1ee17
>>      health HEALTH_WARN
>>             2225 pgs degraded
>>             18 pgs down
>>             18 pgs peering
>>             89 pgs stale
>>             2225 pgs stuck degraded
>>             18 pgs stuck inactive
>>             89 pgs stuck stale
>>             2257 pgs stuck unclean
>>             2225 pgs stuck undersized
>>             2225 pgs undersized
>>             recovery 4180820/11837906 objects degraded (35.317%)
>>             recovery 24016/11837906 objects misplaced (0.203%)
>>             12/39 in osds are down
>>             noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub
>> flag(s) set
>>      monmap e9: 7 mons at {}
>>             election epoch 170, quorum 0,1,2,3,4,5,6
>>      osdmap e40290: 40 osds: 27 up, 39 in; 14 remapped pgs
>>             flags
>> noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub
>>       pgmap v39326300: 4096 pgs, 4 pools, 21455 GB data, 5780 kobjects
>>             42407 GB used, 75772 GB / 115 TB avail
>>             4180820/11837906 objects degraded (35.317%)
>>             24016/11837906 objects misplaced (0.203%)
>>                 2136 active+undersized+degraded
>>                 1837 active+clean
>>                   89 stale+active+undersized+degraded
>>                   18 down+peering
>>                   14 active+remapped
>>                    2 active+clean+scrubbing+deep
>>   client io 0 B/s rd, 9509 kB/s wr, 3469 op/s
>>
>> On Tue, Sep 13, 2016 at 1:34 PM, M Ranga Swami Reddy
>> <swamireddy@xxxxxxxxx> wrote:
>> > Please check if any osd is nearfull ERR. Can you please share the ceph
>> > -s
>> > o/p?
>> >
>> > Thanks
>> > Swami
>> >
>> > On Tue, Sep 13, 2016 at 3:54 PM, Daznis <daznis@xxxxxxxxx> wrote:
>> >>
>> >> Hello,
>> >>
>> >>
>> >> I have encountered a strange I/O freeze while rebooting one OSD node
>> >> for maintenance purpose. It was one of the 3 Nodes in the entire
>> >> cluster. Before this rebooting or shutting down and entire node just
>> >> slowed down the ceph, but not completely froze it.
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users@xxxxxxxxxxxxxx
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com