Hi All,
Is there a known procedure to debug the PG state in case of problems like this?
Best regards,
Yuri.
2017-08-28 14:05 GMT+03:00 Yuri Gorshkov <ygorshkov@xxxxxxxxxxxx>:
Hi.When trying to take down a host for maintenance purposes I encountered an I/O stall along with some PGs marked 'peered' unexpectedly.Cluster stats: 96/96 OSDs, healthy prior to incident, 5120 PGs, 4 hosts consisting of 24 OSDs each. Ceph version 11.2.0, using standard filestore (with LVM journals on SSD) and default crush map. All pools are size 3, min_size 2.Steps to reproduce the problem:0. Cluster is healthy, HEALTH_OK1. Set noout flag to prepare for host removal.2. Begin taking OSDs on one of the hosts down: systemctl stop ceph-osd@$osd.3. Notice the IO has stalled unexpectedly and about 100 PGs total are in degraded+undersized+peered state if the host is down.AFAIK the 'peered' state means that the PG has not been replicated to min_size yet, so there is something strange going on. Since we have 4 hosts and are using the default crush map, how is it possible that after taking one host (or even just some OSDs on that host) down some PGs in the cluster are left with less than 2 copies?Here's the snippet of 'ceph pg dump_stuck' when this happened. Sadly I don't have any more information yet...# ceph pg dump|grep peereddumped all in format plain3.c80 173 0 346 692 0 715341824 10041 10041 undersized+degraded+remapped+backfill_wait+peered 2017-08-02 19:12:39.319222 12124'104727 12409:62777 [62,76,44] 62 [2] 2 1642'32485 2017-07-18 22:57:06.263727 1008'135 2017-07-09 22:34:40.893182 3.204 184 0 368 649 0 769544192 10065 10065 undersized+degraded+remapped+backfill_wait+peered 2017-08-02 19:12:39.334905 12124'13665 12409:37345 [75,52,1] 75 [2] 2 1375'4316 2017-07-18 00:10:27.601548 1371'2740 2017-07-12 07:48:34.953831 11.19 25525 0 51050 78652 0 14829768529 10059 10059 undersized+degraded+remapped+backfill_wait+peered 2017-08-02 19:12:39.311612 12124'156267 12409:137128 [56,26,14] 56 [18] 18 1375'28148 2017-07-17 20:27:04.916079 0'0 2017-07-10 16:12:49.270606 --Sincerely,Yuri GorshkovSystems Engineer
Sincerely,
Yuri Gorshkov
Systems Engineer
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com