Re: OOM's on the Ceph client machine

Gregory Farnum <gregf@xxxxxxxxxxxxxxx> · Tue, 12 Oct 2010 19:30:48 -0700



On Tue, Oct 12, 2010 at 5:31 PM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> Hi there,
>
> I've recently been playing with Ceph on an evaluation basis, and found
> that I was able to fairly reliably induce an OOM kill on my the ceph
> client machine by using FFSB with the following configuration file (see
> attached, below).
Does this mean you're using cfuse rather than the kernel client?
FUSE performance in general is fairly disappointing and our cfuse is
probably not as fast as the kernel client even so, though I don't
think it should be *that* unhappy in most environments.

> I am using Ceph v0.21.3 plus a few commits that were on the testing
> branch as of late September (commit ID 569d96b).  The Ceph cluster
> contains 10 commodity servers with 5 disks configured for Ceph object
> storage on each server (plus a separate spindle for the journal files),
> so there are 5 instances of cosd on each OSD server.  The disks are
> formatted using ext4 in no-journal mode.  I am using 3 servers for the
> MDS and montioring daemons, with the MDS and monitoring daemons
> colocated these 3 servers.  The machines all have gigabit ethernet
> cards.
So you have 5 journals running on one spindle? This could be the cause
of your slightly low sequential write performance; in the current
default configuration writes have to go to the journal before going to
the main disk and with multiple OSDs on one journal spindle they could
be getting in each other's way.
Also, how much memory do you have on these machines?

> P.S.  In case people are curious, here are the results of the "boxacle"
> (http://btrfs.boxacle.net) FFSB workloads that I ran.  The results are
> fairly stable, except very often the 8 thread random_write workload is a
> little hard to reproduce because it very often OOM's.  I've never gotten
> a 32 thread random_write workload measurement, since it very reliably
> OOM's on my client machine.
>
> Do these results look reasonable to you?  I confess I'm a little
> disappointed with the sequential and random read numbers in particular.
> And given 10 servers and fifty spindles, even the large_file_create
> numbers seems surprising slow.
I'm not familiar with FFSB and there doesn't seem to be any
easily-accessible documentation, can you tell us a little more about
how it works? For instance, how are the test files created (are they
written out for the reads and then tested? Does the random write
create the files as it goes, or are they pre-existing and then
overwritten)?
A few thoughts/wild guesses:
I'm not sure exactly what the limit is, but 114MB/s reads are close to
what you can get over a 1Gb link.
If single-threaded FFSB means there's only one request in-flight at a
time there may be a latency issue which is causing those 35MB/s reads.
The kernel client ought to be prefetching but maybe it's not doing so
properly, and I don't recall how much prefetching cfuse is actually
capable of. Sage can say more on this.

> (Also, given the we are using gigabit ethernet in this evaluation
> cluster, the 1GB/sec seems ridiculously high, which suggests to me that
> the fsync request wasn't honored -- FFSB includes the fsync time when
> calculating write bandwidth -- and it may explain why we are OOM'ing in
> the random_write workload.)
Err, yes. Extremely odd. In glancing over cfuse this looks like it's
working properly, but if you confirm that's what you're using I'll
trace it.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html