Re: CEPH cache layer. Very slow

Voloshanenko Igor <igor.voloshanenko@xxxxxxxxx> · Fri, 14 Aug 2015 21:15:50 +0300

72 osd, 60 hdd, 12 ssdPrimary workload - rbd, kvm

пятница, 14 августа 2015 г. пользователь Ben Hines  написал:
Nice to hear that you have no SSD failures yet in 10months.

How many OSDs are you running, and what is your primary ceph workload?

(RBD, rgw, etc?)

-Ben

On Fri, Aug 14, 2015 at 2:23 AM, Межов Игорь Александрович

<megov@xxxxxxxxxx> wrote:

> Hi!

>

>

> Of course, it isn't cheap at all, but we use Intel DC S3700 200Gb for ceph

> journals

> and DC S3700 400Gb in the SSD pool: same hosts, separate root in crushmap.

>

> SSD pool are not yet in production, journаlling SSDs works under production

> load

> for 10 months. They're in good condition - no faults, no degradation.

>

> We specially take 200Gb SSD for journals to reduce costs, and also have a

> higher

> than recommended OSD/SSD ratio: 1 SSD per 10-12 ODS, whille recommended

> 1/3 to 1/6.

>

> So, as a conclusion - I'll recommend you to get a bigger budget and buy

> durable

> and fast SSDs for Ceph.

>

> Megov Igor

> CIO, Yuterra

>

> ________________________________

> От: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> от имени Voloshanenko

> Igor <igor.voloshanenko@xxxxxxxxx>

> Отправлено: 13 августа 2015 г. 15:54

> Кому: Jan Schermer

> Копия: ceph-users@xxxxxxxxxxxxxx

> Тема: Re:  CEPH cache layer. Very slow

>

> So, good, but price for 845 DC PRO 400 GB higher in about 2x times than

> intel S3500 240G (((

>

> Any other models? (((

>

> 2015-08-13 15:45 GMT+03:00 Jan Schermer <jan@xxxxxxxxxxx>:

>>

>> I tested and can recommend the Samsung 845 DC PRO (make sure it is DC PRO

>> and not just "PRO" or "DC EVO"!).

>> Those were very cheap but are out of stock at the moment (here).

>> Faster than Intels, cheaper, and slightly different technology (3D V-NAND)

>> which IMO makes them superior without needing many tricks to do its job.

>>

>> Jan

>>

>> On 13 Aug 2015, at 14:40, Voloshanenko Igor <igor.voloshanenko@xxxxxxxxx>

>> wrote:

>>

>> Tnx, Irek! Will try!

>>

>> but another question to all, which SSD good enough for CEPH now?

>>

>> I'm looking into S3500 240G (I have some S3500 120G which show great

>> results. Around 8x times better than Samsung)

>>

>> Possible you can give advice about other vendors/models with same or below

>> price level as S3500 240G?

>>

>> 2015-08-13 12:11 GMT+03:00 Irek Fasikhov <malmyzh@xxxxxxxxx>:

>>>

>>> Hi, Igor.

>>> Try to roll the patch here:

>>>

>>> http://www.theirek.com/blog/2014/02/16/patch-dlia-raboty-s-enierghoniezavisimym-keshiem-ssd-diskov

>>>

>>> P.S. I am no longer tracks changes in this direction(kernel), because we

>>> use already recommended SSD

>>>

>>> С уважением, Фасихов Ирек Нургаязович

>>> Моб.: +79229045757

>>>

>>> 2015-08-13 11:56 GMT+03:00 Voloshanenko Igor

>>> <igor.voloshanenko@xxxxxxxxx>:

>>>>

>>>> So, after testing SSD (i wipe 1 SSD, and used it for tests)

>>>>

>>>> root@ix-s2:~# sudo fio --filename=/dev/sda --direct=1 --sync=1

>>>> --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based

>>>> --gr[53/1800]

>>>> ting --name=journal-test

>>>> journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,

>>>> iodepth=1

>>>> fio-2.1.3

>>>> Starting 1 process

>>>> Jobs: 1 (f=1): [W] [100.0% done] [0KB/1152KB/0KB /s] [0/288/0 iops] [eta

>>>> 00m:00s]

>>>> journal-test: (groupid=0, jobs=1): err= 0: pid=2849460: Thu Aug 13

>>>> 10:46:42 2015

>>>>   write: io=68972KB, bw=1149.6KB/s, iops=287, runt= 60001msec

>>>>     clat (msec): min=2, max=15, avg= 3.48, stdev= 1.08

>>>>      lat (msec): min=2, max=15, avg= 3.48, stdev= 1.08

>>>>     clat percentiles (usec):

>>>>      |  1.00th=[ 2704],  5.00th=[ 2800], 10.00th=[ 2864], 20.00th=[

>>>> 2928],

>>>>      | 30.00th=[ 3024], 40.00th=[ 3088], 50.00th=[ 3280], 60.00th=[

>>>> 3408],

>>>>      | 70.00th=[ 3504], 80.00th=[ 3728], 90.00th=[ 3856], 95.00th=[

>>>> 4016],

>>>>      | 99.00th=[ 9024], 99.50th=[ 9280], 99.90th=[ 9792],

>>>> 99.95th=[10048],

>>>>      | 99.99th=[14912]

>>>>     bw (KB  /s): min= 1064, max= 1213, per=100.00%, avg=1150.07,

>>>> stdev=34.31

>>>>     lat (msec) : 4=94.99%, 10=4.96%, 20=0.05%

>>>>   cpu          : usr=0.13%, sys=0.57%, ctx=17248, majf=0, minf=7

>>>>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,

>>>> >=64=0.0%

>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

>>>> >=64=0.0%

>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

>>>> >=64=0.0%

>>>>      issued    : total=r=0/w=17243/d=0, short=r=0/w=0/d=0

>>>>

>>>> Run status group 0 (all jobs):

>>>>   WRITE: io=68972KB, aggrb=1149KB/s, minb=1149KB/s, maxb=1149KB/s,

>>>> mint=60001msec, maxt=60001msec

>>>>

>>>> Disk stats (read/write):

>>>>   sda: ios=0/17224, merge=0/0, ticks=0/59584, in_queue=59576,

>>>> util=99.30%

>>>>

>>>> So, it's pain... SSD do only 287 iops on 4K... 1,1 MB/s

>>>>

>>>> I try to change cache mode :

>>>> echo temporary write through > /sys/class/scsi_disk/2:0:0:0/cache_type

>>>> echo temporary write through > /sys/class/scsi_disk/3:0:0:0/cache_type

>>>>

>>>> no luck, still same shit results, also i found this article:

>>>> https://lkml.org/lkml/2013/11/20/264 pointed to old very simple patch,

>>>> which disable CMD_FLUSH

>>>> https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba

>>>>

>>>> Has everybody better ideas, how to improve this? (or disable CMD_FLUSH

>>>> without recompile kernel, i used ubuntu and 4.0.4 for now (4.x branch

>>>> because SSD 850 Pro have issue with NCQ TRIM< and before 4.0.4 this

>>>> exception was not included into libsata.c)

>>>>

>>>> 2015-08-12 19:17 GMT+03:00 Pieter Koorts <pieter.koorts@xxxxxx>:

>>>>>

>>>>> Hi Igor

>>>>>

>>>>> I suspect you have very much the same problem as me.

>>>>>

>>>>> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg22260.html

>>>>>

>>>>> Basically Samsung drives (like many SATA SSD's) are very much hit and

>>>>> miss so you will need to test them like described here to see if they are

>>>>> any good.

>>>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

>>>>>

>>>>> To give you an idea my average performance went from 11MB/s (with

>>>>> Samsung SSD) to 30MB/s (without any SSD) on write performance. This is a

>>>>> very small cluster.

>>>>>

>>>>> Pieter

>>>>>

>>>>> On Aug 12, 2015, at 04:33 PM, Voloshanenko Igor

>>>>> <igor.voloshanenko@xxxxxxxxx> wrote:

>>>>>

>>>>> Hi all, we have setup CEPH cluster with 60 OSD (2 diff types) (5 nodes,

>>>>> 12 disks on each, 10 HDD, 2 SSD)

>>>>>

>>>>> Also we cover this with custom crushmap with 2 root leaf

>>>>>

>>>>> ID   WEIGHT  TYPE NAME              UP/DOWN REWEIGHT PRIMARY-AFFINITY

>>>>> -100 5.00000 root ssd

>>>>> -102 1.00000     host ix-s2-ssd

>>>>>    2 1.00000         osd.2               up  1.00000          1.00000

>>>>>    9 1.00000         osd.9               up  1.00000          1.00000

>>>>> -103 1.00000     host ix-s3-ssd

>>>>>    3 1.00000         osd.3               up  1.00000          1.00000

>>>>>    7 1.00000         osd.7               up  1.00000          1.00000

>>>>> -104 1.00000     host ix-s5-ssd

>>>>>    1 1.00000         osd.1               up  1.00000          1.00000

>>>>>    6 1.00000         osd.6               up  1.00000          1.00000

>>>>> -105 1.00000     host ix-s6-ssd

>>>>>    4 1.00000         osd.4               up  1.00000          1.00000

>>>>>    8 1.00000         osd.8               up  1.00000          1.00000

>>>>> -106 1.00000     host ix-s7-ssd

>>>>>    0 1.00000         osd.0               up  1.00000          1.00000

>>>>>    5 1.00000         osd.5               up  1.00000          1.00000

>>>>>   -1 5.00000 root platter

>>>>>   -2 1.00000     host ix-s2-platter

>>>>>   13 1.00000         osd.13              up  1.00000          1.00000

>>>>>   17 1.00000         osd.17              up  1.00000          1.00000

>>>>>   21 1.00000         osd.21              up  1.00000          1.00000

>>>>>   27 1.00000         osd.27              up  1.00000          1.00000

>>>>>   32 1.00000         osd.32              up  1.00000          1.00000

>>>>>   37 1.00000         osd.37              up  1.00000          1.00000

>>>>>   44 1.00000         osd.44              up  1.00000          1.00000

>>>>>   48 1.00000         osd.48              up  1.00000          1.00000

>>>>>   55 1.00000         osd.55              up  1.00000          1.00000

>>>>>   59 1.00000         osd.59              up  1.00000          1.00000

>>>>>   -3 1.00000     host ix-s3-platter

>>>>>   14 1.00000         osd.14              up  1.00000          1.00000

>>>>>   18 1.00000         osd.18              up  1.00000          1.00000

>>>>>   23 1.00000         osd.23              up  1.00000          1.00000

>>>>>   28 1.00000         osd.28              up  1.00000          1.00000

>>>>>   33 1.00000         osd.33              up  1.00000          1.00000

>>>>>   39 1.00000         osd.39              up  1.00000          1.00000

>>>>>   43 1.00000         osd.43              up  1.00000          1.00000

>>>>>   47 1.00000         osd.47              up  1.00000          1.00000

>>>>>   54 1.00000         osd.54              up  1.00000          1.00000

>>>>>   58 1.00000         osd.58              up  1.00000          1.00000

>>>>>   -4 1.00000     host ix-s5-platter

>>>>>   11 1.00000         osd.11              up  1.00000          1.00000

>>>>>   16 1.00000         osd.16              up  1.00000          1.00000

>>>>>   22 1.00000         osd.22              up  1.00000          1.00000

>>>>>   26 1.00000         osd.26              up  1.00000          1.00000

>>>>>   31 1.00000         osd.31              up  1.00000          1.00000

>>>>>   36 1.00000         osd.36              up  1.00000          1.00000

>>>>>   41 1.00000         osd.41              up  1.00000          1.00000

>>>>>   46 1.00000         osd.46              up  1.00000          1.00000

>>>>>   51 1.00000         osd.51              up  1.00000          1.00000

>>>>>   56 1.00000         osd.56              up  1.00000          1.00000

>>>>>   -5 1.00000     host ix-s6-platter

>>>>>   12 1.00000         osd.12              up  1.00000          1.00000

>>>>>   19 1.00000         osd.19              up  1.00000          1.00000

>>>>>  24 1.00000         osd.24              up  1.00000          1.00000

>>>>>   29 1.00000         osd.29              up  1.00000          1.00000

>>>>>   34 1.00000         osd.34              up  1.00000          1.00000

>>>>>   38 1.00000         osd.38              up  1.00000          1.00000

>>>>>   42 1.00000         osd.42              up  1.00000          1.00000

>>>>>   50 1.00000         osd.50              up  1.00000          1.00000

>>>>>   53 1.00000         osd.53              up  1.00000          1.00000

>>>>>   57 1.00000         osd.57              up  1.00000          1.00000

>>>>>   -6 1.00000     host ix-s7-platter

>>>>>   10 1.00000         osd.10              up  1.00000          1.00000

>>>>>   15 1.00000         osd.15              up  1.00000          1.00000

>>>>>   20 1.00000         osd.20              up  1.00000          1.00000

>>>>>   25 1.00000         osd.25              up  1.00000          1.00000

>>>>>   30 1.00000         osd.30              up  1.00000          1.00000

>>>>>   35 1.00000         osd.35              up  1.00000          1.00000

>>>>>   40 1.00000         osd.40              up  1.00000          1.00000

>>>>>   45 1.00000         osd.45              up  1.00000          1.00000

>>>>>   49 1.00000         osd.49              up  1.00000          1.00000

>>>>>   52 1.00000         osd.52              up  1.00000          1.00000

>>>>>

>>>>>

>>>>> Then create 2 pools, 1 on HDD (platters), 1 on SSD/

>>>>> and put SSD pul in from of HDD pool (cache tier)

>>>>>

>>>>> now we receive very bad performance results from cluster.

>>>>> Even with rados bench we received very unstable performance with even

>>>>> zero speed. So it's create very big issues for our clients.

>>>>>

>>>>> I try to tune all possible values, including OSD, but still no luck.

>>>>>

>>>>> Also very unbelievble situation, when i do

>>>>> ceph tell... bench on SSD OSD - i receive about 20MB/s

>>>>> If for HDD - 67 MB/s...

>>>>>

>>>>> I don;t understand why cache pools which consist of SSD works so bad...

>>>>> We used Samsung 850 Pro 256 Gb as SSDs

>>>>>

>>>>> Can you guys give me advice please...

>>>>>

>>>>> Also very idiotic thing, when i set cache-mode to forward and try to

>>>>> flush-evict all object (not all object evicted, some busy (locked on KVM

>>>>> sides). but now i receive quite stable results for rados bench

>>>>>

>>>>>  Total time run:         30.275871

>>>>> Total writes made:      2076

>>>>> Write size:             4194304

>>>>> Bandwidth (MB/sec):     274.278

>>>>>

>>>>> Stddev Bandwidth:       75.1445

>>>>> Max bandwidth (MB/sec): 368

>>>>> Min bandwidth (MB/sec): 0

>>>>> Average Latency:        0.232892

>>>>> Stddev Latency:         0.240356

>>>>> Max latency:            2.01436

>>>>> Min latency:            0.0716344

>>>>>

>>>>> Without zeros, etc...  So i don't understand how it's possible.

>>>>>

>>>>> Also interesting thing, when i disable overlay for pool, rados bench

>>>>> become around 70MB/s as for ordinary HDD, but in same time rados bench for

>>>>> SSD pool, which not used anymore show same bad results...

>>>>>

>>>>> So please, give me some direction to deeg...

>>>>>

>>>>>

>>>>> _______________________________________________

>>>>> ceph-users mailing list

>>>>> ceph-users@xxxxxxxxxxxxxx

>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>>

>>>>

>>>>

>>>> _______________________________________________

>>>> ceph-users mailing list

>>>> ceph-users@xxxxxxxxxxxxxx

>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>>

>>>

>>

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>

>>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com