SSD journals killed by VMs generating 500 IOPs (4kB) non-stop for a month, seemingly because of a syslog-ng bug

Alex Moore <alex@xxxxxxxxxx> · Sun, 22 Nov 2015 17:40:48 +0000

I just had 2 of the 3 SSD journals in my small 3-node cluster fail 
within 24 hours of each other (not fun, although thanks to a replication 
factor of 3x, at least I didn't lose any data). The journals were 128 GB 
Samsung 850 Pros. However I have determined that it wasn't really their 
fault...

This is a small Ceph cluster running just a handful of relatively idle 
Qemu VMs using librbd for storage, and I had originally estimated that 
based on my low expected volume of write IO the Samsung 850 Pro journals 
would last at least 5 years (which would have been plenty). I still 
think that estimate was correct, but the reason they died prematurely 
(in reality they lasted 15 months) seems to have been that a number of 
my VMs had been hammering their disks continuously for almost a month, 
and I only noticed retrospectively after the journals had died. I 
tracked it back to some sort of bug in syslog-ng: the affected VMs took 
an update to syslog-ng on October 24th, and then ever since the 
following daily logrotate early on the 25th, the syslog daemons were 
together generating about 500 IOPs of 4kB writes continuously for the 
next 4 weeks until the journals then failed.

As a result, I reckon that taking write amplification into account the 
SSDs must have each written just over 1PB over that period - way more 
than they are supposed to be able to handle - so I can't blame the SSDs.

I do have graphs tracking various metrics for the Ceph cluster, 
including IOPs, latency, and read/write throughput - which is how I 
worked out what happened afterwards - but unfortunately I didn't have 
any alerting set up to warn me when there were anomalies in the graphs, 
and I wasn't proactively looking at the graphs on a regular basis.

So I think there is a lesson to be learned here... even if you have 
correctly spec'd your SSD journals in terms of endurance for the 
anticipated level of write activity in a cluster, it's still important 
to keep an eye on ensuring that the write activity matches expectations, 
as it's quite easy for a misbehaving VM to severely drain the life 
expectancy of SSDs by generating 4k write IOs as quickly as it can for a 
long period of time!

I have now replaced all 3 journals with 240 GB Samsung SM863 SSDs, which 
were only about twice the cost of the smaller 850 Pros. And I'm already 
noticing a massive performance improvement (reduction in write latency, 
and higher IOPs). So I'm not too upset about having unnecessarily killed 
the 850 Pros. But I thought it was worth sharing the experience...

FWIW the OSDs themselves are on 1TB Samsung 840 Evos, which I have been 
happy with so far (they've been going for about 18 months at this stage).

Alex

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com