Re: grid data placement take 2

Sam Lang <sam.lang@xxxxxxxxxxx> · Fri, 15 Feb 2013 14:12:05 -0600

On Fri, Feb 15, 2013 at 12:46 PM, Dimitri Maziuk <dmaziuk@xxxxxxxxxxxxx> wrote:
> On 02/15/2013 12:00 PM, Gregory Farnum wrote:
>
>> That's a lot more of a slowdown than I'd expect to see, but there
> isn't much hint about where the slow-down is actually happening. I don't
> recall precisely what's happening in the kernel client when you do a
> bunch of mmaps — Sage? Does it require a network round-trip when you do
> that or will it cache and pre-read appropriately?
>
>>
>> But more generally you'll need to describe your workload pattern a
>> bit more, and do some benchmarks at lower layers of the stack to see
>> what kind of bandwidth is available to begin with. Look at the rados
>> bench stuff to get some data on disk and then do a bunch of
>> simultaneous read benchmarks to see how fast your OSDs can serve data
>> up under a fairly reasonable streaming workload; check out
>> smalliobenchrados to do some IO that more closely mimics your
>> application, etc.
>
> This *was* the benchmark. Each host has the data on local hard drive,
> that is the kind of bandwidth that's available to begin with. All other
> things are equal,
>  - baseline test mmaps from /dev/sdb1 mounted as ext4,
>  - ceph test mmaps from /dev/sdb1 mounted  as cephfs -> (hopefully
> local) osd.

Its only going to be local 1/3 of the time (on average).  What does
your raw network bandwidth look like?

It looks like ceph is using the generic readahead mechanisms in the
kernel, with a default readahead size of 8MB.  That seems like it
should be sufficiently large to get avoid all the network roundtrips,
but you can try to increase the readahead with the rasize mount option
(which should be a multiple of 4096).

-sam

> The slowdown is an order of magnitude and change.
>
> As for workload pattern:
>
> Last I looked at mmap (been a while) it'd place data in shared memory
> and COW it if needed. Since the application is now writing, that's never
> needed -- so cephfs's mounted read-only. There should not be any caches
> to invalidate or resync.
>
> The application is (was last I looked) able to search in only 2GB at a
> time so the search space is split into 2GB files. Each of the worker
> hosts has 4GB RAM/core so each running instance should be able to get
> through at least one 2GB chunk without thrashing (i.e. sequential read
> of 2GB file and no disk i/o until it's done with that and needs to read
> the next one). (In fact, the fastest way to fly is throw enough RAM at
> the host to have all of the search data in RAM all the time.)
>
> I/o-wise the worst part is the start of the batch where all instances
> start reading the 1st file at the 1st byte. After that it starts to
> spread out as they're going through the search space at different rates
> (due to diffs in their search targets). The good news, if you call it
> that, is that ceph didn't keel over during that initial spike. (But
> that's only 16 parallel jobs; our very small cluster can do only 62
> ATM.) The bad news is 3 jobs/hour sounds like what I can probably get by
> placing the search data on nfs and having all 16 jobs hit the single nfs
> server.
>
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com