Re: Real world benefit from SSD Journals for a more read than write cluster

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Sun, 12 Jul 2015 14:33:03 +0200

On 07/12/15 05:55, Alex Gorbachev wrote:
> FWIW. Based on the excellent research by Mark Nelson
> (http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/)
> we have dropped SSD journals altogether, and instead went for the
> battery protected controller writeback cache.

Note that this has limitations (and the research is nearly 2 years old):
- the controller writeback caches are relatively small (often less than
4GB, 2GB is common on the controller, a small portion is not usable, and
10% of the rest is often used for readahead/read cache) and this is
shared by all of your drives. If your workload is not "write spikes"
oriented, but nearly constant writes this won't help as you will be
limited on each OSD by roughly half of the disk IOPS. With journals on
SSDs when you hit their limit (which is ~5GB of buffer for 10GB journals
and not <2GB divided by the amount of OSDs per controller), the limit is
the raw disk IOPS.
- you *must* make sure the controller is configured to switch to
write-through when the battery/capacitor fails (or a power failure on
hardware from the same generation could make you lose all of the OSDs
connected to them in a single event which means data loss),
- you should monitor the battery/capacitor status to trigger maintenance
(and your cluster will slow down while the battery/capacitor is waiting
for a replacement, you might want to down the associated OSDs depending
on your cluster configuration). We mostly eliminated this problem by
replacing the whole chassis of the servers we lease for new generations
every 2 or 3 years: if you time the hardware replacement to match a
fresh chassis generation this means fresh capacitors and they shouldn't
fail you (ours are rated for 3 years).

We just ordered Intel S3710 SSDs even though we have battery/capacitor
backed caches on the controllers: the latencies have started to rise
nevertheless when there are long periods of write intensive activity.
I'm currently pondering if we should bypass the write-cache for the
SSDs. The cache is obviously less effective on them and might be more
useful overall if it is dedicated to the rotating disks. Does anyone
have test results with cache active/inactive on SSD journals with HP
Smart Array p420 or p840 controllers?

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com