Hi, I'm in the middle of expanding a Ceph cluster and while having 'ceph -s' open I suddenly saw a bunch of Placement Groups go undersized. My first hint was that one or more OSDs have failed, but none did. So I checked and I saw these Placement Groups undersized: 11.3b54 active+undersized+degraded+remapped+backfill_wait [1795,639,1422] 1795 [1795,639] 1795 11.362f active+undersized+degraded+remapped+backfill_wait [1431,1134,2217] 1431 [1134,1468] 1134 11.3e31 active+undersized+degraded+remapped+backfill_wait [1451,1391,1906] 1451 [1906,2053] 1906 11.50c active+undersized+degraded+remapped+backfill_wait [1867,1455,1348] 1867 [1867,2036] 1867 11.421e active+undersized+degraded+remapped+backfilling [280,117,1421] 280 [280,117] 280 11.700 active+undersized+degraded+remapped+backfill_wait [2212,1422,2087] 2212 [2055,2087] 2055 11.735 active+undersized+degraded+remapped+backfilling [772,1832,1433] 772 [772,1832] 772 11.d5a active+undersized+degraded+remapped+backfill_wait [423,1709,1441] 423 [423,1709] 423 11.a95 active+undersized+degraded+remapped+backfill_wait [1433,1180,978] 1433 [978,1180] 978 11.a67 active+undersized+degraded+remapped+backfill_wait [1154,1463,2151] 1154 [1154,2151] 1154 11.10ca active+undersized+degraded+remapped+backfill_wait [2012,486,1457] 2012 [2012,486] 2012 11.2439 active+undersized+degraded+remapped+backfill_wait [910,1457,1193] 910 [910,1193] 910 11.2f7e active+undersized+degraded+remapped+backfill_wait [1423,1356,2098] 1423 [1356,2098] 1356 After searching I found that OSDs 1422,1431,1451,1455,1421,1422,1433,1441,1433,1463,1457,1457 and 1423 are all running on the same (newly) added host. I checked: - The host did not reboot - The OSDs did not restart The OSDs are up_thru since map 646724 which is from 11:05 this morning (4,5 hours ago), which is about the same time when these were added. So these PGs are currently running on *2* replicas while they should be running on *3*. We just added 8 nodes with 24 disks each to the cluster, but none of the existing OSDs were touched. When looking at PG 11.3b54 I see that 1422 is a backfill target: $ ceph pg 11.3b54 query|jq '.recovery_state' The 'enter time' for this is about 30 minutes ago and that's about the same time this has happened. 'might_have_unfound' tells me OSD 1982 which is in the same rack as 1422 (CRUSH replicates over racks), but that OSD is also online. It's up_thru = 647122 and that's from about 30 minutes ago. That ceph-osd process is however running since September and seems to be functioning fine. This confuses me as during such an expansion I know that normally a PG would map to size+1 until the backfill finishes. The cluster is running Luminous 12.2.8 on CentOS 7.5. Any ideas on what this could be? Wido _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com