Re: Why Ceph's aggregate write throughput does not scale with the number of osd nodes

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Thu, 24 Nov 2011 10:54:23 -0800

On Wed, Nov 23, 2011 at 3:44 PM, Xiaofei Du <xiaofei.du008@xxxxxxxxx> wrote:
> I am sure both the network and disk IO bandwidth are not so bad. I did
> test on HDFS on the same machines, the throughput is much higher. I
> just used ceph osd tell\* bench, and I got six osd nodes with 40~50
> MB/s. But two of the osds are a little bit slower. they are around 20
> MB/s. There is a sentence "The performance of a single OSD can bring
> down your whole cluster! " in the OSD performance section of the link
> you gave me. But I think two slower osds will only influence 1/4 of
> the overall throughput. But the test result showed me no scalability.
> And two slower disk IO would also influence HDFS throughput. But it
> showed scalability. So I think bandwidth is not a problem.

Unfortunately, that is the problem. Don't forget that Ceph is writing
everything twice, and it's synchronous replication. 50% of your writes
have to wait on those 20MB/s nodes to finish saving the data to disk.
(I believe HDFS uses async replication.) Plus because it's journaling,
all data gets written twice on each OSD. Reads go much faster because
they only hit the slow OSDs for 25% of operations, and are often
served out of memory.
We haven't yet gotten around to working out ways of automatically
handling this situation, but there are a few things you can do:
1) Remove the bad nodes from your cluster. Your performance should go
up by a lot.
2) Change the weight of the slow nodes so they get a lower percentage
of writes compared to the other nodes. (The default is 1.)
http://ceph.newdream.net/wiki/Monitor_commands#reweight

>>> Actually when I ls the ceph file system. it was very slow to give me
>>> the result (There was only eight files, not tons of files). It seemed
>>> like all the clients were competing for some resource. But the mon and
>>> mds resource usage almost equals to zero. That's reasonable because I
>>> only have 8 osds and several clients. So this seems quite weird.
>> Well, ls is a slower operation on Ceph than many systems due to the
>> way it handles client capabilities. But this issue will be separate
>> from the one you're seeing above.
> The problem is when I don't do writing, the ls operation is fast. When
> I was writing two files, the ls operation became quite slow. ls
> operation should not be so slow even I am writing the file. So I guess
> maybe there are some resource contentions in it?
When you try and ls a directory the MetaData Server needs to get
current statistics on every file in the directory. If there are other
nodes currently writing to the directory that involves revoking
"capabilities." Generally this should be pretty fast but is sometimes
still noticeable.
If it's more than a second or so, we probably broke things so that
nodes are trying to flush out all their buffered writes before
releasing caps. (We've done this before, although I thought it was
fixed in current code.) What versions are you running?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html