On 11/15/18 4:37 AM, Gregory Farnum wrote: > This is weird. Can you capture the pg query for one of them and narrow > down in which epoch it “lost” the previous replica and see if there’s > any evidence of why? So I checked it further and dug deeper into the logs and found this on osd.1982: 2018-11-14 15:03:04.261689 7fde7b525700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.1982 down, but it is still running 2018-11-14 15:03:04.261713 7fde7b525700 0 log_channel(cluster) log [DBG] : map e647120 wrongly marked me down at e647120 After searching further (Zabbix graphs) it seems that this machine had a spike in CPU load around that time which probably caused it to be marked as down. As OSD 1982 was involved which these PGs it's now in undersized+degraded state. Recovery didn't start, but Ceph choose to wait for the backfill to happen as the PG needed to be vacated from this OSD. The side-effect is that it took 14 hours before these PGs started to backfill. I would say that a PG which is in undersized+degraded should get the highest possible priority to be repaired asap. Wido > On Wed, Nov 14, 2018 at 8:09 PM Wido den Hollander <wido@xxxxxxxx > <mailto:wido@xxxxxxxx>> wrote: > > Hi, > > I'm in the middle of expanding a Ceph cluster and while having 'ceph -s' > open I suddenly saw a bunch of Placement Groups go undersized. > > My first hint was that one or more OSDs have failed, but none did. > > So I checked and I saw these Placement Groups undersized: > > 11.3b54 active+undersized+degraded+remapped+backfill_wait > [1795,639,1422] 1795 [1795,639] 1795 > 11.362f active+undersized+degraded+remapped+backfill_wait > [1431,1134,2217] 1431 [1134,1468] 1134 > 11.3e31 active+undersized+degraded+remapped+backfill_wait > [1451,1391,1906] 1451 [1906,2053] 1906 > 11.50c active+undersized+degraded+remapped+backfill_wait > [1867,1455,1348] 1867 [1867,2036] 1867 > 11.421e active+undersized+degraded+remapped+backfilling > [280,117,1421] 280 [280,117] 280 > 11.700 active+undersized+degraded+remapped+backfill_wait > [2212,1422,2087] 2212 [2055,2087] 2055 > 11.735 active+undersized+degraded+remapped+backfilling > [772,1832,1433] 772 [772,1832] 772 > 11.d5a active+undersized+degraded+remapped+backfill_wait > [423,1709,1441] 423 [423,1709] 423 > 11.a95 active+undersized+degraded+remapped+backfill_wait > [1433,1180,978] 1433 [978,1180] 978 > 11.a67 active+undersized+degraded+remapped+backfill_wait > [1154,1463,2151] 1154 [1154,2151] 1154 > 11.10ca active+undersized+degraded+remapped+backfill_wait > [2012,486,1457] 2012 [2012,486] 2012 > 11.2439 active+undersized+degraded+remapped+backfill_wait > [910,1457,1193] 910 [910,1193] 910 > 11.2f7e active+undersized+degraded+remapped+backfill_wait > [1423,1356,2098] 1423 [1356,2098] 1356 > > After searching I found that OSDs > 1422,1431,1451,1455,1421,1422,1433,1441,1433,1463,1457,1457 and 1423 are > all running on the same (newly) added host. > > I checked: > - The host did not reboot > - The OSDs did not restart > > The OSDs are up_thru since map 646724 which is from 11:05 this morning > (4,5 hours ago), which is about the same time when these were added. > > So these PGs are currently running on *2* replicas while they should be > running on *3*. > > We just added 8 nodes with 24 disks each to the cluster, but none of the > existing OSDs were touched. > > When looking at PG 11.3b54 I see that 1422 is a backfill target: > > $ ceph pg 11.3b54 query|jq '.recovery_state' > > The 'enter time' for this is about 30 minutes ago and that's about the > same time this has happened. > > 'might_have_unfound' tells me OSD 1982 which is in the same rack as 1422 > (CRUSH replicates over racks), but that OSD is also online. > > It's up_thru = 647122 and that's from about 30 minutes ago. That > ceph-osd process is however running since September and seems to be > functioning fine. > > This confuses me as during such an expansion I know that normally a PG > would map to size+1 until the backfill finishes. > > The cluster is running Luminous 12.2.8 on CentOS 7.5. > > Any ideas on what this could be? > > Wido > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com