Re: Why Ceph's aggregate write throughput does not scale with the number of osd nodes

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Thu, 24 Nov 2011 11:50:31 -0800

On Thu, Nov 24, 2011 at 11:32 AM, Xiaofei Du <xiaofei.du008@xxxxxxxxx> wrote:
>> Unfortunately, that is the problem. Don't forget that Ceph is writing
>> everything twice, and it's synchronous replication. 50% of your writes
>> have to wait on those 20MB/s nodes to finish saving the data to disk.
>> (I believe HDFS uses async replication.) Plus because it's journaling,
>> all data gets written twice on each OSD. Reads go much faster because
>> they only hit the slow OSDs for 25% of operations, and are often
>> served out of memory.
>> We haven't yet gotten around to working out ways of automatically
>> handling this situation, but there are a few things you can do:
>> 1) Remove the bad nodes from your cluster. Your performance should go
>> up by a lot.
>> 2) Change the weight of the slow nodes so they get a lower percentage
>> of writes compared to the other nodes. (The default is 1.)
>> http://ceph.newdream.net/wiki/Monitor_commands#reweight
> It is not 50%. It is still 25% of the writes need to wait for the 2
> slow nodes no matter how many replication it has to write or
> journaling. And even though two nodes are slower than the others. It
> still should scale up to some extent. But now I didn't see any scale
> up. Many times even slower I guess due to some resource contention.
> (HDFS uses sync replication too, it uses pipeline for writing the
> data). I tested using large files (15G), so they are not served out of
> memory for reading.
By default, each write goes to a primary, and the primary then
replicates that write to a replica. The write is not considered
complete until it is data safe on every active replica. (You can
increase or decrease the number of replicas, but I haven't seen
anywhere that you did.) So 1/4 of your writes will have a slow node as
a primary, and 3/4 will have a fast node as a primary. Then 1/7 of
your slow primary writes will also have a slow replica, and 6/7 will
have a fast replica (but you are still bound by the slow one either
way). And 5/7 of your fast primary writes will have a fast replica,
but 2/7 will have a slow replica. So you end up with 1/4 (slow
primary) plus 3/4 * 2/7 = 6/28 (slow replica) of your writes being
bounded by the write speed on a slow node, which is 46% of the writes.
Like I said, journaling also hurts your performance here, since it
requires that every write go to disk twice. This is necessary to
provide the data consistency guarantees that Ceph provides, but
doesn't do us any favors in performance (especially on old disks).

You have two nodes that write at 20MB/s and six that write at 45MB/s.
Journaling cuts each of those in half: two nodes at 10MB/s, six at
23MB/s. That's an aggregate 158MB/s of write bandwidth, but each write
goes to two OSDs so you only get half of that, or 79MB/s of write
bandwidth. You are getting about the bandwidth I would expect out of a
slow and unbalanced cluster.

>> When you try and ls a directory the MetaData Server needs to get
>> current statistics on every file in the directory. If there are other
>> nodes currently writing to the directory that involves revoking
>> "capabilities." Generally this should be pretty fast but is sometimes
>> still noticeable.
>> If it's more than a second or so, we probably broke things so that
>> nodes are trying to flush out all their buffered writes before
>> releasing caps. (We've done this before, although I thought it was
>> fixed in current code.) What versions are you running?
> ls is much longer than 1 second. I used v0.34 and v0.37. Both have this issue.
Are you using ceph-fuse or the kernel client? And if it's the kernel
client, what version?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html