On Wed, 21 Sep 2011, Martin Mailand wrote: > hi, > I have a few question about the rbd performance. I have a small ceph > installation, three osd server one monitor server and one compute node which > maps a rbd image to a block device, all server a connectet via a dedicated > 1Gbs network. > Each osd is capable of doing around 90MB/s tested with osd bench. > But if I test the write speed of the rbd block device the performance ist > quite poor. > > I do the test with > dd if=/dev/zero of=/dev/rbd0 bs=1M count=10000 oflag=direct, > I get a throughput around 25MB/s. > I used wireshark to graph the network throughput, the image is > http://tuxadero.com/multistorage/ceph.jpg > as you can see the throughput is not smooth. > > The graph for the test without the oflag=direct is > http://tuxadero.com/multistorage/ceph2.jpg > which is much better, but I the compute node uses around 4-5G of it's RAM as a > writeback cache, which is not acceptable for my application. > > For comparison the graph for a scp transfer. > http://tuxadero.com/multistorage/scp.jpg > > I read in the ceph doku, that ever "package" has to be commited to the disk on > the osd, before it is acknowledged to the client, could you please expalain > what a package is? Probably not a TCP package. You probably mean "object".. each write has to be on disk before it is acknowledged. > And on the mailinglist was a discussion about a writeback window, to my > understanding it say how many byte can be unacknowledged in transit, is > that right? Right. > How could I activate it? So far it's currently only implemented in librbd (the userland implementation). The problem is that your dd is doing synchronous writes to the block device, which are synchronously written to the OSD. That means a lot of time waiting around for the last write to complete before starting to send the next one. Normal hard disks have a cache that absorbs this. They acknowledge the write immediately, and only promise that the data will actually be durable when you issue a flush command later. In librbd, we just added a write window that gives you similar performance. We acknowledge writes immediately and do the write asynchronously, with a cap on the amount of outstanding bytes. This doesn't coalesce small writes into big ones like a real cache, but usually the filesystem does most of that, so we should get similar performance. Anyway, the kernrel implementation doesn't do that yet. It's on the todo list for the next 2 weeks... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html