On 04/18/2013 08:42 AM, Emmanuel Lacour wrote:
On Thu, Apr 18, 2013 at 08:25:50AM -0500, Mark Nelson wrote:
thanks for your answer!
It makes me a bit nervous that you are seeing such a discrepancy
between the drives. Were you expecting that one server would be so
much faster than the other? If a drive is is starting to fail your
results may be unpredictable.
the two servers are far from identical unfortunatly.
first server has two sas 15krpm drives in a RAID 1 (PERC 5/i)
second has two sata 7.2krpm dives in a RAID 1 (aacraid CERC)
Are you doing replication?
yes, as I use default replication which is 2 by default.
If one server has a slower drive, doing
2x replication, and you are using XFS (which tends to have some
performance overhead with ceph) that might get you down into this
range given than 50MB/s number you posted above.
I don't understand why I can only send datas at 15MB/s when it should be
written to two devices that can do 50MB/s :(
So Ceph pseudo-randomly distributes data to different OSDs, which means
that you are more or less limited by the slowest OSD in your system. IE
if one node can only process X objects per second, outstanding
operations will slowly back up on it until you max out the number of
outstanding operations that are allowed and the other OSDs get starved
while the slow one tries to catch up.
So lets say 50MB/s per device to match your slow one.
1) If you put your journals on the same devices, you are doing 2 writes
for every incoming write since we do full data journalling. Assuming
that's the case we are down to 25MB/s.
2) Now, are you writing to a pool that has 2X replication? If so, you
are writing out an object to both devices for every write, but also
incurring extra latency because the primary OSD will wait until it has
replicated a write to the secondary OSD before it can acknowledge to the
client. With replication of 2 and 2 servers, that means that our
aggregate throughput at best can only be 25MB/s if each server can only
individually do 25MB/s. In reality because of the extra overhead
involved, it will probably be less.
3) Now we must also account for the extra overhead that XFS causes. We
suggest XFS because it's stable, but especially on ceph prior to version
0.58, it's not typically as fast as BTRFS/EXT4. Some things that might
help are using noatime and inode64, making sure you are describing your
RAID array to XFS, and make sure your partitions are properly aligned
for the RAID. One other suggestion: If your controllers have WB cache,
enabling it can really help in some cases.
Can you explain me a bit more on this or point me to some design doc?
Xfs is the recommended FS for the kernel used (3.2.0). And btrfs is
still experimental :-/
You may try
connecting to the OSD admin sockets during tests and poll to see if
all of the outstanding operations are backing up on one OSD.
Sebastien has a nice little tutorial on how to use the admin socket here:
http://www.sebastien-han.fr/blog/2012/08/14/ceph-admin-socket/
thanks, I'm going to look at this ...
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com