Re: Why Ceph's aggregate write throughput does not scale with the number of osd nodes

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Wed, 23 Nov 2011 13:04:40 -0800

On Wed, Nov 23, 2011 at 12:07 PM, Xiaofei Du <xiaofei.du008@xxxxxxxxx> wrote:
> Hi Greg,>> I installed Ceph on 10 ec2 instances. one for mon, one for mds, and> the other eights are osds. I used IOZONE's distributed measurement> mode (Multiple clients that are on different nodes and creating  same> type of workload in parallel) to test the scalability of Ceph. The> problem is with the increase of the number of clients that was doing> writing. the aggregate write throughput didn't scale up. For example.> when I had only one client writing data to ceph, the throughput was> around 60 MB/s. when I had 2 clients writing to two different files to> ceph, the throughput was still around 60MB/s. Same condition with 4> clients and 8 clients. And the Clients were all on different nodes.> But the aggregate reading throughput did scale up, which told us that> the data was distributed on different osds. Otherwise it wouldn't> scale up with the number of the reading clients.
I just want to make sure: each client is writing to a different file?

My first question is if you're sure that your EC2 cluster actually
provides more bandwidth than you're getting! We haven't tried Ceph on
EC2 but I've heard elsewhere that the disk IO rates are pretty
atrocious. (Don't forget to account for replication — that 60MB is
actually 120MB of data going to disk, and 15MB/s per disk isn't out of
the realm of possibility in many cloud setups.)
If you don't have your own tools for measuring this, you can use the
built-in one described at
http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance:
Watch the monitor output with ceph -w, then in a different window run:
ceph osd tell \* bench
Each OSD will write 1GB of data to disk and report it back (you'll see
it in the monitor output). Make sure those numbers are significantly
higher than the bandwidth you're seeing!

There are a number of other problems you could be running into, but
actual disk IO limits (or possibly network troubles) is my suspicion
since we generally see that one client can push 100+MB/s of data. :)
On Wed, Nov 23, 2011 at 12:48 PM, Xiaofei Du <xiaofei.du008@xxxxxxxxx> wrote:
> Actually when I ls the ceph file system. it was very slow to give me
> the result (There was only eight files, not tons of files). It seemed
> like all the clients were competing for some resource. But the mon and
> mds resource usage almost equals to zero. That's reasonable because I
> only have 8 osds and several clients. So this seems quite weird.
Well, ls is a slower operation on Ceph than many systems due to the
way it handles client capabilities. But this issue will be separate
from the one you're seeing above.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html