Re: anyone using CephFS for HPC?

Mark Nelson <mnelson@xxxxxxxxxx> · Sun, 14 Jun 2015 19:37:46 -0500

On 06/14/2015 06:53 PM, Nigel Williams wrote:
On 12/06/2015 3:41 PM, Gregory Farnum wrote:
...  and the test evaluation was on repurposed Lustre
hardware so it was a bit odd, ...

Agree, it was old (at least by now) DDN kit (SFA10K?) and not ideally
suited for Ceph (really high OSD per host ratio).

FWIW, I did most of the performance work on the Ceph side for that 
paper.  Let me know if you are interested in any of the details.  It was 
definitely not ideal, though in the end we did relatively well I think. 
 Ultimately the lack of SSD journals hurt us as we hit the IB limit to 
the SFA10K long before we hit the disk limits, and we were topping out 
at about 6-8GB/s for sequential reads when we should have been able to 
hit 12GB/s.  We have seen some cases where filestore doesn't do large 
reads as quickly as you'd think (newstore seems to do better).

The big things that took a lot of effort to figure out during this 
testing were:

- General strange
- cache mirroring on the SFA10k *really* hurting performance with Ceph 
(Not sure why it didn't hurt Lustre as badly)
- Back around kernel 3.6 there were some nasty VM compaction issues that 
caused major performance problems.
- Somewhat strange mdtest results.  Probably just issues in the MDS back 
then.

Sage's thesis or some of the earlier papers will be happy to tell you
all the ways in which Ceph > Lustre, of course, since creating a
successor is how the project started. ;)
-Greg

Thanks Greg, yes those original documents have been well-thumbed; but I
was hoping someone had done a more recent comparison given the
significant improvements over the last couple of Ceph releases.

My superficial poking about in Lustre doesn't reveal to me anything
particularly compelling in the design or typical deployments that would
magically yield higher-performance than an equally well tuned Ceph
cluster. Blair Bethwaite commented that Lustre client-side write caching
might be more effective than CephFS at the moment.

I suspect the big things are:

- Lustre doesn't do asynchronous replication (relies on hardware raid)
- Lustre may have more tuning issues worked out.
- Lustre doesn't (last I checked) do full data journaling.

Frankly a well-tuned Lustre configuration is going to do pretty well for 
large sequential IO.  That's pretty much it's bread and butter.  At 
least historically it's not been great at small random IO, and most 
lustre setups use some kind of STONITH setup for node outage which is 
obviously not nearly as nice as Ceph is.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com