Re: I/O freeze while a single node is down.

Sean Redmond <sean.redmond1@xxxxxxxxx> · Tue, 13 Sep 2016 11:51:42 +0100

Hi,
The host that is taken down has 12 disks in it?

Have a look at the down PG's '18 pgs down' - I suspect this will be what is causing the I/O freeze.

Is your cursh map setup correctly to split data over different hosts? 

Thanks

On Tue, Sep 13, 2016 at 11:45 AM, Daznis <daznis@xxxxxxxxx> wrote:
No, no errors about that. I have set noout before it happened, but it

still started recovery. I have added

nobackfill,norebalance,norecover,noscrub,nodeep-scrub once i noticed

it started doing crazy stuff. So recovery I/O stopped but the cluster

can't read any info. Only writes to cache layer.

    cluster cdca2074-4c91-4047-a607-faebcbc1ee17

     health HEALTH_WARN

            2225 pgs degraded

            18 pgs down

            18 pgs peering

            89 pgs stale

            2225 pgs stuck degraded

            18 pgs stuck inactive

            89 pgs stuck stale

            2257 pgs stuck unclean

            2225 pgs stuck undersized

            2225 pgs undersized

            recovery 4180820/11837906 objects degraded (35.317%)

            recovery 24016/11837906 objects misplaced (0.203%)

            12/39 in osds are down

            noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub

flag(s) set

     monmap e9: 7 mons at {}

            election epoch 170, quorum 0,1,2,3,4,5,6

     osdmap e40290: 40 osds: 27 up, 39 in; 14 remapped pgs

            flags noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub

      pgmap v39326300: 4096 pgs, 4 pools, 21455 GB data, 5780 kobjects

            42407 GB used, 75772 GB / 115 TB avail

            4180820/11837906 objects degraded (35.317%)

            24016/11837906 objects misplaced (0.203%)

                2136 active+undersized+degraded

                1837 active+clean

                  89 stale+active+undersized+degraded

                  18 down+peering

                  14 active+remapped

                   2 active+clean+scrubbing+deep

  client io 0 B/s rd, 9509 kB/s wr, 3469 op/s

On Tue, Sep 13, 2016 at 1:34 PM, M Ranga Swami Reddy

<swamireddy@xxxxxxxxx> wrote:

> Please check if any osd is nearfull ERR. Can you please share the ceph -s

> o/p?

>

> Thanks

> Swami

>

> On Tue, Sep 13, 2016 at 3:54 PM, Daznis <daznis@xxxxxxxxx> wrote:

>>

>> Hello,

>>

>>

>> I have encountered a strange I/O freeze while rebooting one OSD node

>> for maintenance purpose. It was one of the 3 Nodes in the entire

>> cluster. Before this rebooting or shutting down and entire node just

>> slowed down the ceph, but not completely froze it.

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com