OSD_FULL raised when osd was not full (octopus 15.2.16)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Yesterday we hit OSD_FULL / POOL_FULL conditions for two brief moments. As all OSDs are present in all pools, all IO was stalled. Which impacted a few MDs clients (got evicted). Although the impact was limited, I *really* would like to understand how that could happen, as it should not have happened as far as I can tell. And it freaked me out. Logs:

2022-06-01T14:04:00.043+0200 7fbbc683a700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 full osd(s) (OSD_FULL) 2022-06-01T14:04:06.159+0200 7fbbc683a700 0 log_channel(cluster) log [INF] : Health check cleared: OSD_FULL (was: 1 full osd(s)) 2022-06-01T14:04:11.319+0200 7fbbc683a700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 full osd(s) (OSD_FULL) 2022-06-01T14:04:33.027+0200 7fbbc683a700 0 log_channel(cluster) log [INF] : Health check cleared: OSD_FULL (was: 1 full osd(s))

The weird thing was, the fullest OSD at that time was 82.759% full (something we monitor very closely). OSD_FULL ratio was 0.9. Backfill-full 0.9. Nearfull ratio 0.85.

OSD_NEARFULL was never logged. So somehow it "jumped" to this state, a few times.

Observation: It seems that the osd ID of the full OSD(s) are not logged anywhere. OSD_NEARFULL osds *do* get logged. I did not have time to type a ceph health detail fast enough. I haven't found the code responsible for logging the nearfull osd IDs but I guess it's missing for full osds. I can make a tracker for that.

At the time this flag was raised there were a lot of PGs remapped (~ 1200). There were ~ 21 BACKFILL_TOOFUL and ~ 8 BACKFILLFULL OSDs. The norebalance flag was set. No degraded data, only misplaced. We were performing a "reverse balance" with the upmap-remap.py script (to have ceph balancer slowly move PGs later on). A couple of minutes before we had set out 10 OSDs of one host (hence the remaps). We have performed this operation many times before in the past month without issues.

Was this a glitch? Or is there a valid reason for Ceph to raise a OSD_FULL on an OSD with (potentially) many BACKFILL(NEAR)FULL?

How can I find out what OSD was "full"? I.e. What keywords to grep for in the OSD logs. If it's logged at all of course.

Thanks,

Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux