SSD journals killed by VMs generating 500 IOPs (4kB) non-stop for a month, seemingly because of a syslog-ng bug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I just had 2 of the 3 SSD journals in my small 3-node cluster fail within 24 hours of each other (not fun, although thanks to a replication factor of 3x, at least I didn't lose any data). The journals were 128 GB Samsung 850 Pros. However I have determined that it wasn't really their fault...

This is a small Ceph cluster running just a handful of relatively idle Qemu VMs using librbd for storage, and I had originally estimated that based on my low expected volume of write IO the Samsung 850 Pro journals would last at least 5 years (which would have been plenty). I still think that estimate was correct, but the reason they died prematurely (in reality they lasted 15 months) seems to have been that a number of my VMs had been hammering their disks continuously for almost a month, and I only noticed retrospectively after the journals had died. I tracked it back to some sort of bug in syslog-ng: the affected VMs took an update to syslog-ng on October 24th, and then ever since the following daily logrotate early on the 25th, the syslog daemons were together generating about 500 IOPs of 4kB writes continuously for the next 4 weeks until the journals then failed.

As a result, I reckon that taking write amplification into account the SSDs must have each written just over 1PB over that period - way more than they are supposed to be able to handle - so I can't blame the SSDs.

I do have graphs tracking various metrics for the Ceph cluster, including IOPs, latency, and read/write throughput - which is how I worked out what happened afterwards - but unfortunately I didn't have any alerting set up to warn me when there were anomalies in the graphs, and I wasn't proactively looking at the graphs on a regular basis.

So I think there is a lesson to be learned here... even if you have correctly spec'd your SSD journals in terms of endurance for the anticipated level of write activity in a cluster, it's still important to keep an eye on ensuring that the write activity matches expectations, as it's quite easy for a misbehaving VM to severely drain the life expectancy of SSDs by generating 4k write IOs as quickly as it can for a long period of time!

I have now replaced all 3 journals with 240 GB Samsung SM863 SSDs, which were only about twice the cost of the smaller 850 Pros. And I'm already noticing a massive performance improvement (reduction in write latency, and higher IOPs). So I'm not too upset about having unnecessarily killed the 850 Pros. But I thought it was worth sharing the experience...

FWIW the OSDs themselves are on 1TB Samsung 840 Evos, which I have been happy with so far (they've been going for about 18 months at this stage).

Alex

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux