PGs in peered state?

Yuri Gorshkov <ygorshkov@xxxxxxxxxxxx> · Mon, 28 Aug 2017 14:05:21 +0300

Hi.
When trying to take down a host for maintenance purposes I encountered an I/O stall along with some PGs marked 'peered' unexpectedly.

Cluster stats: 96/96 OSDs, healthy prior to incident, 5120 PGs, 4 hosts consisting of 24 OSDs each. Ceph version 11.2.0, using standard filestore (with LVM journals on SSD) and default crush map. All pools are size 3, min_size 2.

Steps to reproduce the problem:
0. Cluster is healthy, HEALTH_OK
1. Set noout flag to prepare for host removal.
2. Begin taking OSDs on one of the hosts down: systemctl stop ceph-osd@$osd.
3. Notice the IO has stalled unexpectedly and about 100 PGs total are in degraded+undersized+peered state if the host is down.

AFAIK the 'peered' state means that the PG has not been replicated to min_size yet, so there is something strange going on. Since we have 4 hosts and are using the default crush map, how is it possible that after taking one host (or even just some OSDs on that host) down some PGs in the cluster are left with less than 2 copies?

Here's the snippet of 'ceph pg dump_stuck' when this happened. Sadly I don't have any more information yet...

# ceph pg dump|grep peered
dumped all in format plain
3.c80       173                  0      346       692       0   715341824 10041    10041 undersized+degraded+remapped+backfill_wait+peered 2017-08-02 19:12:39.319222  12124'104727   12409:62777 [62,76,44]         62        [2]              2    1642'32485 2017-07-18 22:57:06.263727        1008'135 2017-07-09 22:34:40.893182 
3.204       184                  0      368       649       0   769544192 10065    10065 undersized+degraded+remapped+backfill_wait+peered 2017-08-02 19:12:39.334905   12124'13665   12409:37345  [75,52,1]         75        [2]              2     1375'4316 2017-07-18 00:10:27.601548       1371'2740 2017-07-12 07:48:34.953831 
11.19     25525                  0    51050     78652       0 14829768529 10059    10059 undersized+degraded+remapped+backfill_wait+peered 2017-08-02 19:12:39.311612  12124'156267  12409:137128 [56,26,14]         56       [18]             18    1375'28148 2017-07-17 20:27:04.916079             0'0 2017-07-10 16:12:49.270606

-- 
Sincerely,
Yuri Gorshkov
Systems Engineer
SmartLabs LLC
+7 (495) 645-44-46 ext. 6926
ygorshkov@xxxxxxxxxxxx

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com