Re: rbd over xfs slow performances

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04/18/2013 08:42 AM, Emmanuel Lacour wrote:
On Thu, Apr 18, 2013 at 08:25:50AM -0500, Mark Nelson wrote:


thanks for your answer!

It makes me a bit nervous that you are seeing such a discrepancy
between the drives.  Were you expecting that one server would be so
much faster than the other?  If a drive is is starting to fail your
results may be unpredictable.


the two servers are far from identical unfortunatly.

first server has two sas 15krpm drives in a RAID 1 (PERC 5/i)
second  has two sata 7.2krpm dives in a RAID 1 (aacraid CERC)


Are you doing replication?

yes, as I use default replication which is 2 by default.

If one server has a slower drive, doing
2x replication, and you are using XFS (which tends to have some
performance overhead with ceph) that might get you down into this
range given than 50MB/s number you posted above.

I don't understand why I can only send datas at 15MB/s when it should be
written to two devices that can do 50MB/s :(

So Ceph pseudo-randomly distributes data to different OSDs, which means that you are more or less limited by the slowest OSD in your system. IE if one node can only process X objects per second, outstanding operations will slowly back up on it until you max out the number of outstanding operations that are allowed and the other OSDs get starved while the slow one tries to catch up.

So lets say 50MB/s per device to match your slow one.

1) If you put your journals on the same devices, you are doing 2 writes for every incoming write since we do full data journalling. Assuming that's the case we are down to 25MB/s.

2) Now, are you writing to a pool that has 2X replication? If so, you are writing out an object to both devices for every write, but also incurring extra latency because the primary OSD will wait until it has replicated a write to the secondary OSD before it can acknowledge to the client. With replication of 2 and 2 servers, that means that our aggregate throughput at best can only be 25MB/s if each server can only individually do 25MB/s. In reality because of the extra overhead involved, it will probably be less.

3) Now we must also account for the extra overhead that XFS causes. We suggest XFS because it's stable, but especially on ceph prior to version 0.58, it's not typically as fast as BTRFS/EXT4. Some things that might help are using noatime and inode64, making sure you are describing your RAID array to XFS, and make sure your partitions are properly aligned for the RAID. One other suggestion: If your controllers have WB cache, enabling it can really help in some cases.


Can you explain me a bit more on this or point me to some design doc?

Xfs is the recommended FS for the kernel used (3.2.0). And btrfs is
still experimental :-/


You may try
connecting to the OSD admin sockets during tests and poll to see if
all of the outstanding operations are backing up on one OSD.

Sebastien has a nice little tutorial on how to use the admin socket here:

http://www.sebastien-han.fr/blog/2012/08/14/ceph-admin-socket/


thanks, I'm going to look at this ...

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux