Re: Why Ceph's aggregate write throughput does not scale with the number of osd nodes

Xiaofei Du <xiaofei.du008@xxxxxxxxx> · Wed, 23 Nov 2011 15:44:28 -0800

> I just want to make sure: each client is writing to a different file?
Yes, each client is writing to a different file.

> My first question is if you're sure that your EC2 cluster actually
> provides more bandwidth than you're getting! We haven't tried Ceph on
> EC2 but I've heard elsewhere that the disk IO rates are pretty
> atrocious. (Don't forget to account for replication — that 60MB is
> actually 120MB of data going to disk, and 15MB/s per disk isn't out of
> the realm of possibility in many cloud setups.)
> If you don't have your own tools for measuring this, you can use the
> built-in one described at
> http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance:
> Watch the monitor output with ceph -w, then in a different window run:
> ceph osd tell \* bench
> Each OSD will write 1GB of data to disk and report it back (you'll see
> it in the monitor output). Make sure those numbers are significantly
> higher than the bandwidth you're seeing!
I am sure both the network and disk IO bandwidth are not so bad. I did
test on HDFS on the same machines, the throughput is much higher. I
just used ceph osd tell\* bench, and I got six osd nodes with 40~50
MB/s. But two of the osds are a little bit slower. they are around 20
MB/s. There is a sentence "The performance of a single OSD can bring
down your whole cluster! " in the OSD performance section of the link
you gave me. But I think two slower osds will only influence 1/4 of
the overall throughput. But the test result showed me no scalability.
And two slower disk IO would also influence HDFS throughput. But it
showed scalability. So I think bandwidth is not a problem.

>> Actually when I ls the ceph file system. it was very slow to give me
>> the result (There was only eight files, not tons of files). It seemed
>> like all the clients were competing for some resource. But the mon and
>> mds resource usage almost equals to zero. That's reasonable because I
>> only have 8 osds and several clients. So this seems quite weird.
> Well, ls is a slower operation on Ceph than many systems due to the
> way it handles client capabilities. But this issue will be separate
> from the one you're seeing above.
The problem is when I don't do writing, the ls operation is fast. When
I was writing two files, the ls operation became quite slow. ls
operation should not be so slow even I am writing the file. So I guess
maybe there are some resource contentions in it?

In overall, do you know other causes for my problem? Or may be the old
bugs still exist?

Best,
Xiaofei

On Wed, Nov 23, 2011 at 1:04 PM, Gregory Farnum
<gregory.farnum@xxxxxxxxxxxxx> wrote:
> On Wed, Nov 23, 2011 at 12:07 PM, Xiaofei Du <xiaofei.du008@xxxxxxxxx> wrote:
>> Hi Greg,>> I installed Ceph on 10 ec2 instances. one for mon, one for mds, and> the other eights are osds. I used IOZONE's distributed measurement> mode (Multiple clients that are on different nodes and creating  same> type of workload in parallel) to test the scalability of Ceph. The> problem is with the increase of the number of clients that was doing> writing. the aggregate write throughput didn't scale up. For example.> when I had only one client writing data to ceph, the throughput was> around 60 MB/s. when I had 2 clients writing to two different files to> ceph, the throughput was still around 60MB/s. Same condition with 4> clients and 8 clients. And the Clients were all on different nodes.> But the aggregate reading throughput did scale up, which told us that> the data was distributed on different osds. Otherwise it wouldn't> scale up with the number of the reading clients.
> I just want to make sure: each client is writing to a different file?
>
> My first question is if you're sure that your EC2 cluster actually
> provides more bandwidth than you're getting! We haven't tried Ceph on
> EC2 but I've heard elsewhere that the disk IO rates are pretty
> atrocious. (Don't forget to account for replication — that 60MB is
> actually 120MB of data going to disk, and 15MB/s per disk isn't out of
> the realm of possibility in many cloud setups.)
> If you don't have your own tools for measuring this, you can use the
> built-in one described at
> http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance:
> Watch the monitor output with ceph -w, then in a different window run:
> ceph osd tell \* bench
> Each OSD will write 1GB of data to disk and report it back (you'll see
> it in the monitor output). Make sure those numbers are significantly
> higher than the bandwidth you're seeing!
>
> There are a number of other problems you could be running into, but
> actual disk IO limits (or possibly network troubles) is my suspicion
> since we generally see that one client can push 100+MB/s of data. :)
> On Wed, Nov 23, 2011 at 12:48 PM, Xiaofei Du <xiaofei.du008@xxxxxxxxx> wrote:
>> Actually when I ls the ceph file system. it was very slow to give me
>> the result (There was only eight files, not tons of files). It seemed
>> like all the clients were competing for some resource. But the mon and
>> mds resource usage almost equals to zero. That's reasonable because I
>> only have 8 osds and several clients. So this seems quite weird.
> Well, ls is a slower operation on Ceph than many systems due to the
> way it handles client capabilities. But this issue will be separate
> from the one you're seeing above.
>

-- 
Xiaofei (Gregory) Du
Department of Computer Science
University of California, Santa Barbara
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html