On Fri, Jul 19, 2013 at 3:44 PM, Pawel Veselov <pawel.veselov@xxxxxxxxx> wrote: > Hi. > > I'm trying to understand the reason behind some of my unclean pages, after > moving some OSDs around. Any help would be greatly appreciated.I'm sure we > are missing something, but can't quite figure out what. > > [root@ip-10-16-43-12 ec2-user]# ceph health detail > HEALTH_WARN 29 pgs degraded; 68 pgs stuck unclean; recovery 4071/217370 > degraded (1.873%) > pg 0.50 is stuck unclean since forever, current state active+degraded, last > acting [2] > ... > pg 2.4b is stuck unclean for 836.989336, current state active+remapped, last > acting [3,2] > ... > pg 0.6 is active+degraded, acting [3] > > These are distinct examples of problems. There are total of 676 page groups. > Query shows pretty much the same on them: . Nit: PG="placement group". :) Anyway, the problem appears to be that you've got two OSDs total, buried under a bit of a hierarchy (rack and host, each) and the pseudo-random nature of CRUSH is just having trouble getting to both of them for mapping all the PGs. If you aren't using the kernel client (or have a very new kernel client, >=3.9) then you can run "ceph crush set tunables optimal" (see http://ceph.com/docs/master/rados/operations/crush-map/#tunables) and this should get all better thanks to some better settings that we worked out last year. On Fri, Jul 19, 2013 at 4:20 PM, Mike Lowe <j.michael.lowe@xxxxxxxxx> wrote: > I'm by no means an expert, but from what I understand you do need to stick > to numbering from zero if you want things to work out in the long term. This is good general advice but it wouldn't cause the kinds of issues seen here, and is really only a problem if the list of OSD numbers is very sparse. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com