Re: any recommendation of using EnhanceIO?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07/02/15 13:49, German Anders wrote:
output from iostat:

CEPHOSD01:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc(ceph-0)       0.00     0.00    1.00  389.00     0.00    35.98   188.96    60.32  120.12   16.00  120.39   1.26  49.20
sdd(ceph-1)       0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf(ceph-2)       0.00     1.00    6.00  521.00     0.02    60.72   236.05   143.10  309.75  484.00  307.74   1.90 100.00
sdg(ceph-3)       0.00     0.00   11.00  535.00     0.04    42.41   159.22   139.25  279.72  394.18  277.37   1.83 100.00
sdi(ceph-4)       0.00     1.00    4.00  560.00     0.02    54.87   199.32   125.96  187.07  562.00  184.39   1.65  93.20
sdj(ceph-5)       0.00     0.00    0.00  566.00     0.00    61.41   222.19   109.13  169.62    0.00  169.62   1.53  86.40
sdl(ceph-6)       0.00     0.00    8.00    0.00     0.09     0.00    23.00     0.12   12.00   12.00    0.00   2.50   2.00
sdm(ceph-7)       0.00     0.00    2.00  481.00     0.01    44.59   189.12   116.64  241.41  268.00  241.30   2.05  99.20
sdn(ceph-8)       0.00     0.00    1.00    0.00     0.00     0.00     8.00     0.01    8.00    8.00    0.00   8.00   0.80
fioa              0.00     0.00    0.00 1016.00     0.00    19.09    38.47     0.00    0.06    0.00    0.06   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc(ceph-0)       0.00     1.00   10.00  278.00     0.04    26.07   185.69    60.82  257.97  309.60  256.12   2.83  81.60
sdd(ceph-1)       0.00     0.00    2.00    0.00     0.02     0.00    20.00     0.02   10.00   10.00    0.00  10.00   2.00
sdf(ceph-2)       0.00     1.00    6.00  579.00     0.02    54.16   189.68   142.78  246.55  328.67  245.70   1.71 100.00
sdg(ceph-3)       0.00     0.00   10.00   75.00     0.05     5.32   129.41     4.94  185.08   11.20  208.27   4.05  34.40
sdi(ceph-4)       0.00     0.00   19.00  147.00     0.09    12.61   156.63    17.88  230.89  114.32  245.96   3.37  56.00
sdj(ceph-5)       0.00     1.00    2.00  629.00     0.01    43.66   141.72   143.00  223.35  426.00  222.71   1.58 100.00
sdl(ceph-6)       0.00     0.00   10.00    0.00     0.04     0.00     8.00     0.16   18.40   18.40    0.00   5.60   5.60
sdm(ceph-7)       0.00     0.00   11.00    4.00     0.05     0.01     8.00     0.48   35.20   25.82   61.00  14.13  21.20
sdn(ceph-8)       0.00     0.00    9.00    0.00     0.07     0.00    15.11     0.07    8.00    8.00    0.00   4.89   4.40
fioa              0.00     0.00    0.00 6415.00     0.00   125.81    40.16     0.00    0.14    0.00    0.14   0.00   0.00

CEPHOSD02:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc1(ceph-9)      0.00     0.00   13.00    0.00     0.11     0.00    16.62     0.17   13.23   13.23    0.00   4.92   6.40
sdd1(ceph-10)     0.00     0.00   15.00    0.00     0.13     0.00    18.13     0.26   17.33   17.33    0.00   1.87   2.80
sdf1(ceph-11)     0.00     0.00   22.00  650.00     0.11    51.75   158.04   143.27  212.07  308.55  208.81   1.49 100.00
sdg1(ceph-12)     0.00     0.00   12.00  282.00     0.05    54.60   380.68    13.16  120.52  352.00  110.67   2.91  85.60
sdi1(ceph-13)     0.00     0.00    1.00    0.00     0.00     0.00     8.00     0.01    8.00    8.00    0.00   8.00   0.80
sdj1(ceph-14)     0.00     0.00   20.00    0.00     0.08     0.00     8.00     0.26   12.80   12.80    0.00   3.60   7.20
sdl1(ceph-15)     0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdm1(ceph-16)     0.00     0.00   20.00  424.00     0.11    32.20   149.05    89.69  235.30  243.00  234.93   2.14  95.20
sdn1(ceph-17)     0.00     0.00    5.00  411.00     0.02    45.47   223.94    98.32  182.28 1057.60  171.63   2.40 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc1(ceph-9)      0.00     0.00   26.00  383.00     0.11    34.32   172.44    86.92  258.64  297.08  256.03   2.29  93.60
sdd1(ceph-10)     0.00     0.00    8.00   31.00     0.09     1.86   101.95     0.84  178.15   94.00  199.87   6.46  25.20
sdf1(ceph-11)     0.00     1.00    5.00  409.00     0.05    48.34   239.34    90.94  219.43  383.20  217.43   2.34  96.80
sdg1(ceph-12)     0.00     0.00    0.00  238.00     0.00     1.64    14.12    58.34  143.60    0.00  143.60   1.83  43.60
sdi1(ceph-13)     0.00     0.00   11.00    0.00     0.05     0.00    10.18     0.16   14.18   14.18    0.00   5.09   5.60
sdj1(ceph-14)     0.00     0.00    1.00    0.00     0.00     0.00     8.00     0.02   16.00   16.00    0.00  16.00   1.60
sdl1(ceph-15)     0.00     0.00    1.00    0.00     0.03     0.00    64.00     0.01   12.00   12.00    0.00  12.00   1.20
sdm1(ceph-16)     0.00     1.00    4.00  587.00     0.03    50.09   173.69   143.32  244.97  296.00  244.62   1.69 100.00
sdn1(ceph-17)     0.00     0.00    0.00  375.00     0.00    23.68   129.34    69.76  182.51    0.00  182.51   2.47  92.80

If the iostat output is typical it seems you are limited by random writes on a subset of your OSDs (you have 9 on each server but you have between 4 and 6 used for writes and wMB/s vs w/s points to a moderately random access pattern).
You should find out why. You may have a configuration problem or the access to your rbds is focused on a few 4MB (if you used the defaults) sections of the devices.


The other OSD server had pretty much the same load.

The config of the OSD's is the following:

- 2x Intel Xeon E5-2609 v2 @ 2.50GHz (4C)
- 128G RAM
- 2x 120G SSD Intel SSDSC2BB12 (RAID-1) for OS
- 2x 10GbE ADPT DP
- Journals are configured to run on RAMDISK (TMPFS), but in the first OSD serv we've the journals going on to a FusionIO (/dev/fioa) ADPT with batt.

I suppose this is not yet production (TMPFS journals). You only have 128G RAM for 9 OSD, what is the size of your journals when you use TMPFS and more importantly what is the value of filestore sync max interval ?
I'm not sure how the OSD will react with a journal with a multi-GB/s write bandwidth: the default filestore sync max interval might be too high (it should prevent the journal from filling up). On the other end a low max interval will prevent the OS from reordering writes to hard drives to avoid too much random IO.

So there might be two causes I can see that might lead to performance problems:
- the IO load might not be distributed to all your OSD limiting your total bandwidth,
- you might have IO freezes when your TMPFS journals fill up if you have very high bursts (probably unlikely but the consequences might be dire).

Another problem I see is cost : Fusion IO speed and cost (and TMPFS speed) are probably overkill for the journals in your case. With your setup 2x Intel DC S3500 would probably be enough (unless you need more write endurance).
With what you save not using a Fusion IO card in each server you could probably have additional servers and get far better performance overall.

If you do, use a 10GB journal size and a filestore max sync interval allowing only half of it to be written to. With 2x 500MB/s write bandwidth divided between 9 balanced OSD this would be 110MB/s so you could use 30s with room to spare.

This assumes you can distribute IOs to all OSDs, you might have to convert your rbds to a lower order or use stripping to achieve this if you have atypical access patterns.

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux