May 10 Upstream Lab Outage

David Galloway <dgallowa@xxxxxxxxxx> · Wed, 12 May 2021 15:06:52 -0400

Hi all,

I wanted to provide an RCA for the outage you may have been affected by yesterday.  Some services that went down:

- All CI/testing
- quay.ceph.io
- telemetry.ceph.com (your cluster may have gone into HEALTH_WARN if you report telemetry data)
- lists.ceph.io (so all mailing lists)

All of our critical infra is running in a Red Hat Virtualization (RHV) instance backed by Red Hat Gluster Storage (RHGS) as the storage.  Before you go, "wait.. Gluster?"  Yes, this cluster was set up before Ceph was supported as backend storage for RHV/RHEV.

The root cause for the outage is the Gluster volumes got 100% full.  Once no writes were possible, RHV paused all the VMs.

Why didn't monitoring catch this?  I honestly don't know.

# grep ssdstore01 nagios-05-*2021* | grep Disk
nagios-05-01-2021-00.log:[1619740800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-02-2021-00.log:[1619827200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-03-2021-00.log:[1619913600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-04-2021-00.log:[1620000000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-05-2021-00.log:[1620086400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-06-2021-00.log:[1620172800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-07-2021-00.log:[1620259200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-08-2021-00.log:[1620345600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-09-2021-00.log:[1620432000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-10-2021-00.log:[1620518400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-11-2021-00.log:[1620604800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now

Yet RHV knew we were running out of space.  I don't have e-mail notifications set up in RHV, however.

# zgrep "disk space" engine*202105*.gz | cut -d ',' -f4 | head -n 10
 Low disk space. hosted_storage domain has 24 GB of free space.
 Low disk space. hosted_storage domain has 24 GB of free space.
 Low disk space. hosted_storage domain has 23 GB of free space.
 Low disk space. hosted_storage domain has 23 GB of free space.
 Low disk space. hosted_storage domain has 23 GB of free space.
 Low disk space. hosted_storage domain has 23 GB of free space.
 Low disk space. hosted_storage domain has 23 GB of free space.
 Low disk space. hosted_storage domain has 21 GB of free space.
 Low disk space. hosted_storage domain has 20 GB of free space.
 Low disk space. hosted_storage domain has 11 GB of free space.

Our nagios instances runs this to check disk space: https://github.com/ceph/ceph-cm-ansible/blob/master/roles/common/files/libexec/diskusage.pl
You can ignore the comment about it only working for EXT2.

[root@ssdstore01 ~]# /usr/libexec/diskusage.pl 90 95
Disks are OK now

I ran this manually on one of the storage hosts and intentionally set the WARN level to a number lower than the current usage percentage.

[root@ssdstore01 ~]# df -h | grep 'Size\|gluster'
Filesystem      Size  Used Avail Use% Mounted on
/dev/md124      8.8T  6.7T  2.1T  77% /gluster

[root@ssdstore01 ~]# /usr/libexec/diskusage.pl 95 70
/gluster is at 77%    
[root@ssdstore01 ~]# echo $?
2

When I logged in to the storage hosts yesterday morning, the /gluster mount was at 100%.  So nagios should have known.

How'd it get fixed?  I happened to have some large capacity drives that fit the storage nodes lying around.  They're being installed in a different project soon.  However, I was able to add these drives, add "bricks" to the Gluster storage, then rebalance the data.  Once that was done, I was able to restart all the VMs and delete old VMs and snapshots I no longer needed.

How do we keep this from happening again?  Well, as you may have been able to deduce... we were running out of space at a rate of 1-10 GB/day.  As you can see now, the Gluster volume has 2.1TB of space left.  So even if we grew by 10GB/day again, we'd be okay for 200ish days.

I aim to have some (if not all) of these services moved off this platform and into an Openshift cluster backed by Ceph this year.  Sadly, I just don't think I have enough logging enabled to nail down exactly what happened.

-- 
David Galloway
Senior Systems Administrator
Ceph Engineering
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx