On Fri, 10 Jun 2016 16:51:07 +0300 George Shuklin wrote: > Hello. > > I'm doing small experimental setup. That's likely your problem. Aside from my response below, really small clusters can wind up in spots where CRUSH (or at least certain versions of it) can't place things correctly. > I have two hosts with few OSD, one > OSD has been put down intentionaly, but I regardless the second (alive) > OSD on different host, I see that all IO (rbd, and even rados get) hung > for long time (more than 30 minutes already). > Downing that OSD effectively killed one host, looking at your tree below. A two node cluster doesn't work with the default setting to begin with, so I assume you changed things to allow for replication size of 2. Now what min_size do your pools have? If it's not 1 (default is 2), then this is expected. Christian > My configuration: > > -9 2.00000 root ssd > -11 1.00000 host ssd-pp7 > 9 1.00000 osd.9 down 0 1.00000 > -12 1.00000 host ssd-pp11 > 1 0.25000 osd.1 up 1.00000 1.00000 > 2 0.25000 osd.2 up 1.00000 1.00000 > 3 0.25000 osd.3 up 1.00000 1.00000 > 11 0.25000 osd.11 up 1.00000 1.00000 > > pg map shows that acting OSD was moved from '9' to others. > > ceph health detail > HEALTH_ERR 5 pgs are stuck inactive for more than 300 seconds; 5 pgs > degraded; 5 pgs stuck inactive; 8 pgs stuck unclean; 5 pgs undersized; > 53 requests are blocked > 32 sec; 2 osds have slow requests; recovery > 2538/8200 objects degraded (30.951%); recovery 1562/8200 objects > misplaced (19.049%); too few PGs per OSD (1 < min 30) > > pg 26.0 is stuck inactive for 1429.756078, current state > undersized+degraded+peered, last acting [1] > pg 26.7 is stuck inactive for 1429.751221, current state > undersized+degraded+peered, last acting [2] > pg 26.2 is stuck inactive for 1429.749713, current state > undersized+degraded+peered, last acting [1] > pg 26.6 is stuck inactive for 1429.763065, current state > undersized+degraded+peered, last acting [2] > pg 26.5 is stuck inactive for 1429.754325, current state > undersized+degraded+peered, last acting [1] > pg 26.0 is stuck unclean for 1429.756101, current state > undersized+degraded+peered, last acting [1] > pg 26.1 is stuck unclean for 1429.778469, current state active+remapped, > last acting [11,3] > pg 26.2 is stuck unclean for 1429.749733, current state > undersized+degraded+peered, last acting [1] > pg 26.3 is stuck unclean for 1429.796471, current state active+remapped, > last acting [1,2] > pg 26.4 is stuck unclean for 1429.762425, current state active+remapped, > last acting [1,3] > pg 26.5 is stuck unclean for 1429.754349, current state > undersized+degraded+peered, last acting [1] > pg 26.6 is stuck unclean for 1429.763094, current state > undersized+degraded+peered, last acting [2] > pg 26.7 is stuck unclean for 1429.751259, current state > undersized+degraded+peered, last acting [2] > > root@pp11:~# ceph osd pool stats ssd > pool ssd id 26 > nothing is going on > > mons are in quorum (all up) > > osd dump: > > osd.9 down out weight 0 up_from 1055 up_thru 1085 down_at 1089 > last_clean_interval [1017,1052) 78.140.137.210:6800/29731 > 78.140.137.210:6801/29731 78.140.137.210:6802/29731 > 78.140.137.210:6803/29731 autoout,exists > 2fc49cd5-e48c-4189-a67b-229d09378d1c > > > > What should normally happens in this situation and why it no happen? > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com