Re: grid data placement take 2

Dimitri Maziuk <dmaziuk@xxxxxxxxxxxxx> · Fri, 15 Feb 2013 12:46:03 -0600

On 02/15/2013 12:00 PM, Gregory Farnum wrote:

> That's a lot more of a slowdown than I'd expect to see, but there
isn't much hint about where the slow-down is actually happening. I don't
recall precisely what's happening in the kernel client when you do a
bunch of mmaps — Sage? Does it require a network round-trip when you do
that or will it cache and pre-read appropriately?

>
> But more generally you'll need to describe your workload pattern a
> bit more, and do some benchmarks at lower layers of the stack to see
> what kind of bandwidth is available to begin with. Look at the rados
> bench stuff to get some data on disk and then do a bunch of
> simultaneous read benchmarks to see how fast your OSDs can serve data
> up under a fairly reasonable streaming workload; check out
> smalliobenchrados to do some IO that more closely mimics your
> application, etc.

This *was* the benchmark. Each host has the data on local hard drive,
that is the kind of bandwidth that's available to begin with. All other
things are equal,
 - baseline test mmaps from /dev/sdb1 mounted as ext4,
 - ceph test mmaps from /dev/sdb1 mounted  as cephfs -> (hopefully
local) osd.
The slowdown is an order of magnitude and change.

As for workload pattern:

Last I looked at mmap (been a while) it'd place data in shared memory
and COW it if needed. Since the application is now writing, that's never
needed -- so cephfs's mounted read-only. There should not be any caches
to invalidate or resync.

The application is (was last I looked) able to search in only 2GB at a
time so the search space is split into 2GB files. Each of the worker
hosts has 4GB RAM/core so each running instance should be able to get
through at least one 2GB chunk without thrashing (i.e. sequential read
of 2GB file and no disk i/o until it's done with that and needs to read
the next one). (In fact, the fastest way to fly is throw enough RAM at
the host to have all of the search data in RAM all the time.)

I/o-wise the worst part is the start of the batch where all instances
start reading the 1st file at the 1st byte. After that it starts to
spread out as they're going through the search space at different rates
(due to diffs in their search targets). The good news, if you call it
that, is that ceph didn't keel over during that initial spike. (But
that's only 16 parallel jobs; our very small cluster can do only 62
ATM.) The bad news is 3 jobs/hour sounds like what I can probably get by
placing the search data on nfs and having all 16 jobs hit the single nfs
server.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com