Re: poor OSD performance using kernel 3.4

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Tue, 29 May 2012 17:25:31 -0500

On 05/24/2012 09:10 AM, Stefan Priebe - Profihost AG wrote:
Hi list,

today while testing btrfs i discovered a very poor osd performance using
kernel 3.4.

Underlying FS is XFS but it is the same with btrfs.

3.0.30:
~# rados -p data bench 10 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16        41        25   99.9767       100  0.586984  0.447293
     2      16        71        55   109.979       120  0.934388  0.488375
     3      16        99        83   110.647       112   1.15982  0.503111
     4      16       130       114   113.981       124   1.05952  0.516925
     5      16       159       143   114.382       116  0.149313  0.510734
     6      16       188       172   114.649       116  0.287166   0.52203
     7      16       215       199   113.697       108  0.151784  0.531461
     8      16       242       226   112.984       108  0.623478  0.539896
     9      16       265       249   110.651        92   0.50354  0.538504
    10      16       296       280   111.984       124  0.155048  0.542846
Total time run:        10.776153
Total writes made:     297
Write size:            4194304
Bandwidth (MB/sec):    110.243

Average Latency:       0.577534
Max latency:           1.85499
Min latency:           0.091473

3.4:
~# rados -p data bench 10 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16        40        24   95.9794        96  0.393196  0.455936
     2      16        68        52   103.983       112  0.835652  0.517297
     3      16        85        69   91.9849        68   1.00535  0.493058
     4      16        96        80   79.9869        44  0.096564  0.577948
     5      16       103        87   69.5879        28  0.092722  0.589147
     6      16       117       101   67.3216        56  0.222175  0.675334
     7      16       130       114   65.1321        52   0.15677  0.623806
     8      16       144       128   63.9896        56  0.089157   0.56746
     9      16       144       128   56.8794         0         -   0.56746
    10      16       144       128   51.1912         0         -   0.56746
    11      16       144       128   46.5373         0         -   0.56746
    12      16       144       128   42.6591         0         -   0.56746
    13      16       144       128   39.3776         0         -   0.56746
    14      16       144       128   36.5649         0         -   0.56746
    15      16       144       128   34.1272         0         -   0.56746
    16      16       145       129   32.2443       0.5   11.3422  0.650985
Total time run:        16.193871
Total writes made:     145
Write size:            4194304
Bandwidth (MB/sec):    35.816

Average Latency:       1.78467
Max latency:           14.4744
Min latency:           0.088753

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

I setup some tests today to try to replicate your findings (and also 
check results against some previous ones I've done).  I don't think I'm 
seeing exactly the same results as you, but I definitely see xfs 
performing worse in this specific test than btrfs.  I've included the 
results here.

Distro: Ubuntu Oneiric (IE no syncfs in glibc)
Ceph: 0.47.2
Kernel 3.4.0-ceph (autobuild-ceph@gitbuilder-kernel-amd64)
Network: 10GbE

1 Client node
3 Mon nodes
2 OSD nodes with 1 OSD each mounted on a 7200rpm SAS drive.  H700 Raid 
controller with each drive in a 1 disk raid0.  Journals are partitioned 
on a separate drive.  OSD data disks are using WT cache while journals 
are using WB.
btrfs created with -l 64k -n64k, mounted using noatime.
xfs created with -f -d su=64k,sw=1 -i size=2048, mounted using noatime.
rados bench invocation: rados -p data bench 300 write -t 16 -b 4194304

btrfs:

Total time run:        300.413696
Total writes made:     7582
Write size:            4194304
Bandwidth (MB/sec):    100.954

Average Latency:       0.633932
Max latency:           3.78661
Min latency:           0.065734

xfs:

Total time run:        304.435966
Total writes made:     5023
Write size:            4194304
Bandwidth (MB/sec):    65.997

Average Latency:       0.96965
Max latency:           36.4993
Min latency:           0.07516

Full results are available here:

http://nhm.ceph.com/results/mailinglist-tests/

I created seekwatcher movies by running blktrace on the underlying OSD 
data disks during the tests.  These show throughput over time, 
seeks/sec, and visual representation of where the disk is being written 
to for each OSD.  You can see them here:

http://nhm.ceph.com/movies/mailinglist-tests/

As you can see, at least for the quick tests I did this afternoon, the 
performance of the underlying OSD disk is highly correlated with the 
number of seeks being done.  These results may improve with syncfs 
support in Ubuntu 12.04.  If you have your journals on the same disks as 
the OSDs, that will cause even more seeks (in addition to the additional 
to the greater throughput demands).  These are things that we are 
actively investigating and hopefully will be able to improve over the 
coming months.

Thanks,
Mark

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html