Re: backfill_toofull seen on cluster where the most full OSD is at 1%

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Aug 16, 2019, at 5:48 PM, Vikhyat Umrao <vikhyat@xxxxxxxxxx> wrote:

Notice: This email is from an external sender.
 


On Fri, Aug 16, 2019 at 2:58 PM Bryan Stillwell <bstillwell@xxxxxxxxxxx> wrote:
I originally sent this to the old ceph-devel mailing list, so I apologize if you get it twice...

We've run into this issue on the first two clusters after upgrading them to Nautilus (14.2.2).

When marking a single OSD back in to the cluster some PGs will switch to the active+remapped+backfill_wait+backfill_toofull state for a while and then it goes away after some of the other PGs finish backfilling.  This is rather odd because all the data on the cluster could fit on a single drive, but we have over 100 of them:

# ceph -s
 cluster:
   id:     XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
   health: HEALTH_ERR
           Degraded data redundancy (low space): 1 pg backfill_toofull

 services:
   mon: 3 daemons, quorum a1cephmon002,a1cephmon003,a1cephmon004 (age 21h)
   mgr: a1cephmon002(active, since 21h), standbys: a1cephmon003, a1cephmon004
   mds: cephfs:2 {0=a1cephmon002=up:active,1=a1cephmon003=up:active} 1 up:standby
   osd: 143 osds: 142 up, 142 in; 106 remapped pgs
   rgw: 11 daemons active (radosgw.a1cephrgw008, radosgw.a1cephrgw009, radosgw.a1cephrgw010, radosgw.a1cephrgw011, radosgw.a1tcephrgw002, radosgw.a1tcephrgw003, radosgw.a1tcephrgw004, radosgw.a1tcephrgw005, radosgw.a1tcephrgw006, radosgw.a1tcephrgw007, radosgw.a1tcephrgw008)

 data:
   pools:   19 pools, 5264 pgs
   objects: 1.45M objects, 148 GiB
   usage:   658 GiB used, 436 TiB / 437 TiB avail
   pgs:     44484/4351770 objects misplaced (1.022%)
            5158 active+clean
            104  active+remapped+backfill_wait
            1    active+remapped+backfilling
            1    active+remapped+backfill_wait+backfill_toofull

 io:
   client:   19 MiB/s rd, 13 MiB/s wr, 431 op/s rd, 509 op/s wr


I searched the archives, but most of the other people had more full clusters where sometimes this state could be valid.  This bug report seems similar, but the fix was just to make it a warning instead of an error:

https://tracker.ceph.com/issues/39555


So I've created a new tracker ticket to troubleshoot this issue:

https://tracker.ceph.com/issues/4125

Bryan - looks like last digit missing in the tracker URL.

Oops, here's the full URL:


Bryan

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux