Hi all, I'm trying to identify the performance bottlenecks in my experimental Ceph cluster. A little background on my setup: 10 storage servers, each configured with: -(2) dual-core opterons -8 GB of RAM -(6) 750GB disks (1 OSD per disk, 7200 RPM SATA, probably 4-5 years old.). JBOD w/ BTRFS -1GbE -CentOS 6.4, custom kernel 3.7.8 1 dedicated mds/mon server -same specs at OSD nodes (2 more dedicated mons waiting in the wings, recently reinstalled ceph) 1 front-facing node mounting CephFS, with a 10GbE connection into the switch stack housing the storage machines -CentOS 6.4, custom kernel 3.7.8 Some Ceph settings: [osd]
osd journal size = 1000
filestore xattr use omap = true When I try to transfer files in/out via CephFS (10GbE host), I'm seeing only about 230MB/s at peak. First, is this what I should expect? Given 60 OSDs spread across 10 servers, I would have thought I'd get something closer to 400-500 MB/s or more. I tried upping the number of placement groups to 3000 for my 'data' pool (following the formula here: http://ceph.com/docs/master/rados/operations/placement-groups/) with no increase in performance. I also saw no performance difference between XFS and BTRFS. I also see a lot of messages like this in the log: 10.1.6.4:6815/30138 3518 : [WRN] slow request 30.874441 seconds old, received at 2013-07-31 10:52:49.721518: osd_op(client.7763.1:67060 100000003ba.000013d4 [write 0~4194304] 0.102b9365 RETRY=-1 snapc 1=[] e1454) currently waiting for subops from [1] Does anyone have any thoughts as to what the bottleneck may be, if there is one? Or, any idea what I should try to measure to determine the bottleneck? Perhaps my disks are just that bad? :) Cheers, Lincoln |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com