On Mon, Feb 26, 2018 at 6:25 PM, Brian Woods <bpwoods@xxxxxxxxx> wrote: > I have a small test cluster (just two nodes) and after rebuilding it several > times I found my latest configuration that SHOULD be the fastest is by far > the slowest (per thread). > > > I have around 10 spinals that I have an erasure encoded CephFS on. When I > installed several SSDs and recreated it with the meta data and the write > cache on SSD my performance plummeted from about 10-20MBps to 2-3MBps, but > only per thread… I did a rados benchmark and the SSDs Meta and Write pools > can sustain anywhere from 50 to 150MBps without issue. > > > And, if I spool up multiple copies to the FS, each copy adds to that > throughput without much of a hit. In fact I can go up to about 8 copied > (about 16MBps) before they start slowing down at all. Even while I have > several threads actively writing I still benchmark around 25MBps. If a CephFS system is experiencing substantial latency doing metadata operations, then you may find that the overall data throughput is much worse with a single writer process than with several. That would be because typical workloads like "cp" or "tar" are entirely serial, and will wait for one metadata operation (such as creating a file) to complete before doing any more work. In your case, I would suspect that your metadata latency got a lot worse when you switched from dedicating your SSDs to metadata, to sharing your SSDs between metadata and a cache tier. This is one of many situations in which configuring a cache tier can make your performance worse rather than better. Cache tiers generally only make sense if you know you have a "hot" subset of a larger dataset, and that subset fits in your cache tier. > Any ideas why single threaded performance would take a hit like this? Almost > everything is running on a single node (just a few OSDs on another node) and > I have plenty of RAM (96GBs) and CPU (8 Xeon Cores). In general, performance testing you do on 1-2 nodes is unlikely to translate well to what would happen on a more usually sized cluster. If building a "mini" Ceph cluster for performance testing, I'd suggest at the very minimum that you start with three servers for OSDs, a separate one for the MDS, and another separate one for the client. That way, you have network hops in all the right places, rather than having the 2-node situation where some arbitrary 50% of messages are not actually traversing a network, and where clients are competing for CPU time with servers. John > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com