Re: grid data placement take 2

Gregory Farnum <greg@xxxxxxxxxxx> · Fri, 15 Feb 2013 10:00:07 -0800



On Friday, February 15, 2013 at 9:43 AM, Dimitri Maziuk wrote:
>  
> Since nobody's offered any hints on how to get a cluster with no data
> out of "HEALTH_WARN 4 pgs incomplete" state I wiped it all out and
> rebuilt it from "mkcephfs -a -c /etc/ceph/ceph.conf".
>  
> The setup (to recap) is host-a/mon-a/mds-a/osd-0 writes a bunch of data
> to cephfs. Once done, host-x/osd-1, host-y/osd-2, and host-z/osd-3 run
> several instances of an application (1 per core) that searches through
> the data, mmap()'ing it read-only in 2GB chunks. Cehpfs is mounted
> read-only on hosts x..z; the beefier one of them also runs mds-b.
>  
> In the baseline test (data in /var/tmp on x, y, & z) I had jobs complete
> at approx. 125/hour.
>  
> In the first test: ceph left to its own devices, everything at defaults,
> I got to about 6 jobs/hr before concluding it's not getting any faster &
> killing the batch.
>  
> In this test I changed the crush map and pool size & min_size for each
> pool before putting any data on the cluster: I made sure each host has
> equal weight (1) and uniform placement algorithm, and there are 4
> replicas of everything.
>  
> I had 132 jobs complete in ~15 hours -- so we're getting closer to 10
> job/hr, but still nowhere near usable speed.
>  
> I suppose I could believe that cephfs in kernel 3.0 is so bad that
> mmap -> cephfs -> osd -> ext4 -> hdd
> is an order of magnitude slower than
> mmap -> ext4 -> hdd
> on read-only mmap...
>  
> Is that what's going on, or am I doing something else wrong? Or is it
> the kind of performance I should expect at this stage?
>  
> Anyone?
That's a lot more of a slowdown than I'd expect to see, but there isn't much hint about where the slow-down is actually happening. I don't recall precisely what's happening in the kernel client when you do a bunch of mmaps — Sage? Does it require a network round-trip when you do that or will it cache and pre-read appropriately?

But more generally you'll need to describe your workload pattern a bit more, and do some benchmarks at lower layers of the stack to see what kind of bandwidth is available to begin with. Look at the rados bench stuff to get some data on disk and then do a bunch of simultaneous read benchmarks to see how fast your OSDs can serve data up under a fairly reasonable streaming workload; check out smalliobenchrados to do some IO that more closely mimics your application, etc.
-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com