Re: CephFS Single Threaded Performance

John Spray <jspray@xxxxxxxxxx> · Mon, 26 Feb 2018 22:44:26 +0000

On Mon, Feb 26, 2018 at 6:25 PM, Brian Woods <bpwoods@xxxxxxxxx> wrote:
> I have a small test cluster (just two nodes) and after rebuilding it several
> times I found my latest configuration that SHOULD be the fastest is by far
> the slowest (per thread).
>
>
> I have around 10 spinals that I have an erasure encoded CephFS on. When I
> installed several SSDs and recreated it with the meta data and the write
> cache on SSD my performance plummeted from about 10-20MBps to 2-3MBps, but
> only per thread… I did a rados benchmark and the SSDs Meta and Write pools
> can sustain anywhere from 50 to 150MBps without issue.
>
>
> And, if I spool up multiple copies to the FS, each copy adds to that
> throughput without much of a hit. In fact I can go up to about 8 copied
> (about 16MBps) before they start slowing down at all. Even while I have
> several threads actively writing I still benchmark around 25MBps.

If a CephFS system is experiencing substantial latency doing metadata
operations, then you may find that the overall data throughput is much
worse with a single writer process than with several.  That would be
because typical workloads like "cp" or "tar" are entirely serial, and
will wait for one metadata operation (such as creating a file) to
complete before doing any more work.

In your case, I would suspect that your metadata latency got a lot
worse when you switched from dedicating your SSDs to metadata, to
sharing your SSDs between metadata and a cache tier.  This is one of
many situations in which configuring a cache tier can make your
performance worse rather than better.  Cache tiers generally only make
sense if you know you have a "hot" subset of a larger dataset, and
that subset fits in your cache tier.

> Any ideas why single threaded performance would take a hit like this? Almost
> everything is running on a single node (just a few OSDs on another node) and
> I have plenty of RAM (96GBs) and CPU (8 Xeon Cores).

In general, performance testing you do on 1-2 nodes is unlikely to
translate well to what would happen on a more usually sized cluster.
If building a "mini" Ceph cluster for performance testing, I'd suggest
at the very minimum that you start with three servers for OSDs, a
separate one for the MDS, and another separate one for the client.
That way, you have network hops in all the right places, rather than
having the 2-node situation where some arbitrary 50% of messages are
not actually traversing a network, and where clients are competing for
CPU time with servers.

John

>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com