I just had 2 of the 3 SSD journals in my small 3-node cluster fail
within 24 hours of each other (not fun, although thanks to a replication
factor of 3x, at least I didn't lose any data). The journals were 128 GB
Samsung 850 Pros. However I have determined that it wasn't really their
fault...
This is a small Ceph cluster running just a handful of relatively idle
Qemu VMs using librbd for storage, and I had originally estimated that
based on my low expected volume of write IO the Samsung 850 Pro journals
would last at least 5 years (which would have been plenty). I still
think that estimate was correct, but the reason they died prematurely
(in reality they lasted 15 months) seems to have been that a number of
my VMs had been hammering their disks continuously for almost a month,
and I only noticed retrospectively after the journals had died. I
tracked it back to some sort of bug in syslog-ng: the affected VMs took
an update to syslog-ng on October 24th, and then ever since the
following daily logrotate early on the 25th, the syslog daemons were
together generating about 500 IOPs of 4kB writes continuously for the
next 4 weeks until the journals then failed.
As a result, I reckon that taking write amplification into account the
SSDs must have each written just over 1PB over that period - way more
than they are supposed to be able to handle - so I can't blame the SSDs.
I do have graphs tracking various metrics for the Ceph cluster,
including IOPs, latency, and read/write throughput - which is how I
worked out what happened afterwards - but unfortunately I didn't have
any alerting set up to warn me when there were anomalies in the graphs,
and I wasn't proactively looking at the graphs on a regular basis.
So I think there is a lesson to be learned here... even if you have
correctly spec'd your SSD journals in terms of endurance for the
anticipated level of write activity in a cluster, it's still important
to keep an eye on ensuring that the write activity matches expectations,
as it's quite easy for a misbehaving VM to severely drain the life
expectancy of SSDs by generating 4k write IOs as quickly as it can for a
long period of time!
I have now replaced all 3 journals with 240 GB Samsung SM863 SSDs, which
were only about twice the cost of the smaller 850 Pros. And I'm already
noticing a massive performance improvement (reduction in write latency,
and higher IOPs). So I'm not too upset about having unnecessarily killed
the 850 Pros. But I thought it was worth sharing the experience...
FWIW the OSDs themselves are on 1TB Samsung 840 Evos, which I have been
happy with so far (they've been going for about 18 months at this stage).
Alex
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com