Re: rbd over xfs slow performances

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Thu, 18 Apr 2013 09:06:30 -0500

On 04/18/2013 08:42 AM, Emmanuel Lacour wrote:
On Thu, Apr 18, 2013 at 08:25:50AM -0500, Mark Nelson wrote:

thanks for your answer!

It makes me a bit nervous that you are seeing such a discrepancy
between the drives.  Were you expecting that one server would be so
much faster than the other?  If a drive is is starting to fail your
results may be unpredictable.

the two servers are far from identical unfortunatly.

first server has two sas 15krpm drives in a RAID 1 (PERC 5/i)
second  has two sata 7.2krpm dives in a RAID 1 (aacraid CERC)

Are you doing replication?

yes, as I use default replication which is 2 by default.

If one server has a slower drive, doing
2x replication, and you are using XFS (which tends to have some
performance overhead with ceph) that might get you down into this
range given than 50MB/s number you posted above.

I don't understand why I can only send datas at 15MB/s when it should be
written to two devices that can do 50MB/s :(

So Ceph pseudo-randomly distributes data to different OSDs, which means 
that you are more or less limited by the slowest OSD in your system.  IE 
if one node can only process X objects per second, outstanding 
operations will slowly back up on it until you max out the number of 
outstanding operations that are allowed and the other OSDs get starved 
while the slow one tries to catch up.

So lets say 50MB/s per device to match your slow one.

1) If you put your journals on the same devices, you are doing 2 writes 
for every incoming write since we do full data journalling.  Assuming 
that's the case we are down to 25MB/s.

2) Now, are you writing to a pool that has 2X replication?  If so, you 
are writing out an object to both devices for every write, but also 
incurring extra latency because the primary OSD will wait until it has 
replicated a write to the secondary OSD before it can acknowledge to the 
client.  With replication of 2 and 2 servers, that means that our 
aggregate throughput at best can only be 25MB/s if each server can only 
individually do 25MB/s.  In reality because of the extra overhead 
involved, it will probably be less.

3) Now we must also account for the extra overhead that XFS causes.  We 
suggest XFS because it's stable, but especially on ceph prior to version 
0.58, it's not typically as fast as BTRFS/EXT4.  Some things that might 
help are using noatime and inode64, making sure you are describing your 
RAID array to XFS, and make sure your partitions are properly aligned 
for the RAID.  One other suggestion:  If your controllers have WB cache, 
enabling it can really help in some cases.

Can you explain me a bit more on this or point me to some design doc?

Xfs is the recommended FS for the kernel used (3.2.0). And btrfs is
still experimental :-/

You may try
connecting to the OSD admin sockets during tests and poll to see if
all of the outstanding operations are backing up on one OSD.

Sebastien has a nice little tutorial on how to use the admin socket here:

http://www.sebastien-han.fr/blog/2012/08/14/ceph-admin-socket/

thanks, I'm going to look at this ...

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com