Re: Why Ceph's aggregate write throughput does not scale with the number of osd nodes

Xiaofei Du <xiaofei.du008@xxxxxxxxx> · Thu, 24 Nov 2011 14:20:43 -0800

> Eventually. Right now Ceph is pretty config-heavy, unfortunately. So
> like I said, you can change the weights of the slow nodes — this will
> map less of the data to them, so they have fewer writes and reads. But
> even then you're going to be stuck pretty low due to your disk write
> bandwidth. On a modern disk that doesn't matter so much since they can
> push two simultaneous 50MB/s+ streams (ie, one 50MB/s journal and one
> 50MB/s data store), but even so we generally recommend separate
> spindles for the journal.

Yeah. I have tried separating the journaling with the real data
writing on different disks. And also I used btrfs over multiple disks.
That did have a good performance.
Greg, thanks a lot for your help. Enjoy the Thanksgiving!!

Best,
Xiaofei

On Thu, Nov 24, 2011 at 1:03 PM, Gregory Farnum
<gregory.farnum@xxxxxxxxxxxxx> wrote:
> On Thu, Nov 24, 2011 at 12:31 PM, Xiaofei Du <xiaofei.du008@xxxxxxxxx> wrote:
>> This means no matter how many clients I have I should always get
>> around 79MB/s, right? This sounds reasonable. Thanks for the
>> explanation. So do you guys have plans to solve this "unbalanced
>> cluster" problem?
> Eventually. Right now Ceph is pretty config-heavy, unfortunately. So
> like I said, you can change the weights of the slow nodes — this will
> map less of the data to them, so they have fewer writes and reads. But
> even then you're going to be stuck pretty low due to your disk write
> bandwidth. On a modern disk that doesn't matter so much since they can
> push two simultaneous 50MB/s+ streams (ie, one 50MB/s journal and one
> 50MB/s data store), but even so we generally recommend separate
> spindles for the journal.
>
>> I guess several other distributed file systems have
>> the same issue. HDFS has this issue too. I guess the solution is to
>> use stable disk IO bandwidth hardware. If that couldn't be guaranteed,
>> you need to detect slow nodes and kick them out of the cluster.
> I'm not sure how other systems handle it — many don't, HDFS might or
> might not. But as I look at the description of how HDFS replicates
> data it looks to me like it matters less.
> http://hadoop.apache.org/common/docs/current/hdfs_design.html#Robustness
> indicates that all data is written to a client-local file, then
> (asynchronously from the client's write) it is written out to the
> first DataNode, which copies the data to the second, which copies to
> the third, etc. But in this scheme the file doesn't need to be fully
> replicated to each DataNode for the client to close the file and
> consider it data-safe.
> These are the appropriate set of data consistency choices for HDFS,
> but Ceph is designed to satisfy consistency requirements for a much
> more stringent set of needs. :)
>
>>
>>> Are you using ceph-fuse or the kernel client? And if it's the kernel
>>> client, what version?
>> I am using ceph-fuse
> Bah humbug. :( I've created http://tracker.newdream.net/issues/1752 to
> keep track of this issue; it'll be properly prioritized next week.
> -Greg
>

-- 
Xiaofei (Gregory) Du
Department of Computer Science
University of California, Santa Barbara
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html