Re: НА: CEPH cache layer. Very slow

Ben Hines <bhines@xxxxxxxxx> · Fri, 14 Aug 2015 11:01:45 -0700

Nice to hear that you have no SSD failures yet in 10months.

How many OSDs are you running, and what is your primary ceph workload?
(RBD, rgw, etc?)

-Ben

On Fri, Aug 14, 2015 at 2:23 AM, Межов Игорь Александрович
<megov@xxxxxxxxxx> wrote:
> Hi!
>
>
> Of course, it isn't cheap at all, but we use Intel DC S3700 200Gb for ceph
> journals
> and DC S3700 400Gb in the SSD pool: same hosts, separate root in crushmap.
>
> SSD pool are not yet in production, journаlling SSDs works under production
> load
> for 10 months. They're in good condition - no faults, no degradation.
>
> We specially take 200Gb SSD for journals to reduce costs, and also have a
> higher
> than recommended OSD/SSD ratio: 1 SSD per 10-12 ODS, whille recommended
> 1/3 to 1/6.
>
> So, as a conclusion - I'll recommend you to get a bigger budget and buy
> durable
> and fast SSDs for Ceph.
>
> Megov Igor
> CIO, Yuterra
>
> ________________________________
> От: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> от имени Voloshanenko
> Igor <igor.voloshanenko@xxxxxxxxx>
> Отправлено: 13 августа 2015 г. 15:54
> Кому: Jan Schermer
> Копия: ceph-users@xxxxxxxxxxxxxx
> Тема: Re:  CEPH cache layer. Very slow
>
> So, good, but price for 845 DC PRO 400 GB higher in about 2x times than
> intel S3500 240G (((
>
> Any other models? (((
>
> 2015-08-13 15:45 GMT+03:00 Jan Schermer <jan@xxxxxxxxxxx>:
>>
>> I tested and can recommend the Samsung 845 DC PRO (make sure it is DC PRO
>> and not just "PRO" or "DC EVO"!).
>> Those were very cheap but are out of stock at the moment (here).
>> Faster than Intels, cheaper, and slightly different technology (3D V-NAND)
>> which IMO makes them superior without needing many tricks to do its job.
>>
>> Jan
>>
>> On 13 Aug 2015, at 14:40, Voloshanenko Igor <igor.voloshanenko@xxxxxxxxx>
>> wrote:
>>
>> Tnx, Irek! Will try!
>>
>> but another question to all, which SSD good enough for CEPH now?
>>
>> I'm looking into S3500 240G (I have some S3500 120G which show great
>> results. Around 8x times better than Samsung)
>>
>> Possible you can give advice about other vendors/models with same or below
>> price level as S3500 240G?
>>
>> 2015-08-13 12:11 GMT+03:00 Irek Fasikhov <malmyzh@xxxxxxxxx>:
>>>
>>> Hi, Igor.
>>> Try to roll the patch here:
>>>
>>> http://www.theirek.com/blog/2014/02/16/patch-dlia-raboty-s-enierghoniezavisimym-keshiem-ssd-diskov
>>>
>>> P.S. I am no longer tracks changes in this direction(kernel), because we
>>> use already recommended SSD
>>>
>>> С уважением, Фасихов Ирек Нургаязович
>>> Моб.: +79229045757
>>>
>>> 2015-08-13 11:56 GMT+03:00 Voloshanenko Igor
>>> <igor.voloshanenko@xxxxxxxxx>:
>>>>
>>>> So, after testing SSD (i wipe 1 SSD, and used it for tests)
>>>>
>>>> root@ix-s2:~# sudo fio --filename=/dev/sda --direct=1 --sync=1
>>>> --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
>>>> --gr[53/1800]
>>>> ting --name=journal-test
>>>> journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,
>>>> iodepth=1
>>>> fio-2.1.3
>>>> Starting 1 process
>>>> Jobs: 1 (f=1): [W] [100.0% done] [0KB/1152KB/0KB /s] [0/288/0 iops] [eta
>>>> 00m:00s]
>>>> journal-test: (groupid=0, jobs=1): err= 0: pid=2849460: Thu Aug 13
>>>> 10:46:42 2015
>>>>   write: io=68972KB, bw=1149.6KB/s, iops=287, runt= 60001msec
>>>>     clat (msec): min=2, max=15, avg= 3.48, stdev= 1.08
>>>>      lat (msec): min=2, max=15, avg= 3.48, stdev= 1.08
>>>>     clat percentiles (usec):
>>>>      |  1.00th=[ 2704],  5.00th=[ 2800], 10.00th=[ 2864], 20.00th=[
>>>> 2928],
>>>>      | 30.00th=[ 3024], 40.00th=[ 3088], 50.00th=[ 3280], 60.00th=[
>>>> 3408],
>>>>      | 70.00th=[ 3504], 80.00th=[ 3728], 90.00th=[ 3856], 95.00th=[
>>>> 4016],
>>>>      | 99.00th=[ 9024], 99.50th=[ 9280], 99.90th=[ 9792],
>>>> 99.95th=[10048],
>>>>      | 99.99th=[14912]
>>>>     bw (KB  /s): min= 1064, max= 1213, per=100.00%, avg=1150.07,
>>>> stdev=34.31
>>>>     lat (msec) : 4=94.99%, 10=4.96%, 20=0.05%
>>>>   cpu          : usr=0.13%, sys=0.57%, ctx=17248, majf=0, minf=7
>>>>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>> >=64=0.0%
>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>> >=64=0.0%
>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>> >=64=0.0%
>>>>      issued    : total=r=0/w=17243/d=0, short=r=0/w=0/d=0
>>>>
>>>> Run status group 0 (all jobs):
>>>>   WRITE: io=68972KB, aggrb=1149KB/s, minb=1149KB/s, maxb=1149KB/s,
>>>> mint=60001msec, maxt=60001msec
>>>>
>>>> Disk stats (read/write):
>>>>   sda: ios=0/17224, merge=0/0, ticks=0/59584, in_queue=59576,
>>>> util=99.30%
>>>>
>>>> So, it's pain... SSD do only 287 iops on 4K... 1,1 MB/s
>>>>
>>>> I try to change cache mode :
>>>> echo temporary write through > /sys/class/scsi_disk/2:0:0:0/cache_type
>>>> echo temporary write through > /sys/class/scsi_disk/3:0:0:0/cache_type
>>>>
>>>> no luck, still same shit results, also i found this article:
>>>> https://lkml.org/lkml/2013/11/20/264 pointed to old very simple patch,
>>>> which disable CMD_FLUSH
>>>> https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba
>>>>
>>>> Has everybody better ideas, how to improve this? (or disable CMD_FLUSH
>>>> without recompile kernel, i used ubuntu and 4.0.4 for now (4.x branch
>>>> because SSD 850 Pro have issue with NCQ TRIM< and before 4.0.4 this
>>>> exception was not included into libsata.c)
>>>>
>>>> 2015-08-12 19:17 GMT+03:00 Pieter Koorts <pieter.koorts@xxxxxx>:
>>>>>
>>>>> Hi Igor
>>>>>
>>>>> I suspect you have very much the same problem as me.
>>>>>
>>>>> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg22260.html
>>>>>
>>>>> Basically Samsung drives (like many SATA SSD's) are very much hit and
>>>>> miss so you will need to test them like described here to see if they are
>>>>> any good.
>>>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>>>>>
>>>>> To give you an idea my average performance went from 11MB/s (with
>>>>> Samsung SSD) to 30MB/s (without any SSD) on write performance. This is a
>>>>> very small cluster.
>>>>>
>>>>> Pieter
>>>>>
>>>>> On Aug 12, 2015, at 04:33 PM, Voloshanenko Igor
>>>>> <igor.voloshanenko@xxxxxxxxx> wrote:
>>>>>
>>>>> Hi all, we have setup CEPH cluster with 60 OSD (2 diff types) (5 nodes,
>>>>> 12 disks on each, 10 HDD, 2 SSD)
>>>>>
>>>>> Also we cover this with custom crushmap with 2 root leaf
>>>>>
>>>>> ID   WEIGHT  TYPE NAME              UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>>>> -100 5.00000 root ssd
>>>>> -102 1.00000     host ix-s2-ssd
>>>>>    2 1.00000         osd.2               up  1.00000          1.00000
>>>>>    9 1.00000         osd.9               up  1.00000          1.00000
>>>>> -103 1.00000     host ix-s3-ssd
>>>>>    3 1.00000         osd.3               up  1.00000          1.00000
>>>>>    7 1.00000         osd.7               up  1.00000          1.00000
>>>>> -104 1.00000     host ix-s5-ssd
>>>>>    1 1.00000         osd.1               up  1.00000          1.00000
>>>>>    6 1.00000         osd.6               up  1.00000          1.00000
>>>>> -105 1.00000     host ix-s6-ssd
>>>>>    4 1.00000         osd.4               up  1.00000          1.00000
>>>>>    8 1.00000         osd.8               up  1.00000          1.00000
>>>>> -106 1.00000     host ix-s7-ssd
>>>>>    0 1.00000         osd.0               up  1.00000          1.00000
>>>>>    5 1.00000         osd.5               up  1.00000          1.00000
>>>>>   -1 5.00000 root platter
>>>>>   -2 1.00000     host ix-s2-platter
>>>>>   13 1.00000         osd.13              up  1.00000          1.00000
>>>>>   17 1.00000         osd.17              up  1.00000          1.00000
>>>>>   21 1.00000         osd.21              up  1.00000          1.00000
>>>>>   27 1.00000         osd.27              up  1.00000          1.00000
>>>>>   32 1.00000         osd.32              up  1.00000          1.00000
>>>>>   37 1.00000         osd.37              up  1.00000          1.00000
>>>>>   44 1.00000         osd.44              up  1.00000          1.00000
>>>>>   48 1.00000         osd.48              up  1.00000          1.00000
>>>>>   55 1.00000         osd.55              up  1.00000          1.00000
>>>>>   59 1.00000         osd.59              up  1.00000          1.00000
>>>>>   -3 1.00000     host ix-s3-platter
>>>>>   14 1.00000         osd.14              up  1.00000          1.00000
>>>>>   18 1.00000         osd.18              up  1.00000          1.00000
>>>>>   23 1.00000         osd.23              up  1.00000          1.00000
>>>>>   28 1.00000         osd.28              up  1.00000          1.00000
>>>>>   33 1.00000         osd.33              up  1.00000          1.00000
>>>>>   39 1.00000         osd.39              up  1.00000          1.00000
>>>>>   43 1.00000         osd.43              up  1.00000          1.00000
>>>>>   47 1.00000         osd.47              up  1.00000          1.00000
>>>>>   54 1.00000         osd.54              up  1.00000          1.00000
>>>>>   58 1.00000         osd.58              up  1.00000          1.00000
>>>>>   -4 1.00000     host ix-s5-platter
>>>>>   11 1.00000         osd.11              up  1.00000          1.00000
>>>>>   16 1.00000         osd.16              up  1.00000          1.00000
>>>>>   22 1.00000         osd.22              up  1.00000          1.00000
>>>>>   26 1.00000         osd.26              up  1.00000          1.00000
>>>>>   31 1.00000         osd.31              up  1.00000          1.00000
>>>>>   36 1.00000         osd.36              up  1.00000          1.00000
>>>>>   41 1.00000         osd.41              up  1.00000          1.00000
>>>>>   46 1.00000         osd.46              up  1.00000          1.00000
>>>>>   51 1.00000         osd.51              up  1.00000          1.00000
>>>>>   56 1.00000         osd.56              up  1.00000          1.00000
>>>>>   -5 1.00000     host ix-s6-platter
>>>>>   12 1.00000         osd.12              up  1.00000          1.00000
>>>>>   19 1.00000         osd.19              up  1.00000          1.00000
>>>>>  24 1.00000         osd.24              up  1.00000          1.00000
>>>>>   29 1.00000         osd.29              up  1.00000          1.00000
>>>>>   34 1.00000         osd.34              up  1.00000          1.00000
>>>>>   38 1.00000         osd.38              up  1.00000          1.00000
>>>>>   42 1.00000         osd.42              up  1.00000          1.00000
>>>>>   50 1.00000         osd.50              up  1.00000          1.00000
>>>>>   53 1.00000         osd.53              up  1.00000          1.00000
>>>>>   57 1.00000         osd.57              up  1.00000          1.00000
>>>>>   -6 1.00000     host ix-s7-platter
>>>>>   10 1.00000         osd.10              up  1.00000          1.00000
>>>>>   15 1.00000         osd.15              up  1.00000          1.00000
>>>>>   20 1.00000         osd.20              up  1.00000          1.00000
>>>>>   25 1.00000         osd.25              up  1.00000          1.00000
>>>>>   30 1.00000         osd.30              up  1.00000          1.00000
>>>>>   35 1.00000         osd.35              up  1.00000          1.00000
>>>>>   40 1.00000         osd.40              up  1.00000          1.00000
>>>>>   45 1.00000         osd.45              up  1.00000          1.00000
>>>>>   49 1.00000         osd.49              up  1.00000          1.00000
>>>>>   52 1.00000         osd.52              up  1.00000          1.00000
>>>>>
>>>>>
>>>>> Then create 2 pools, 1 on HDD (platters), 1 on SSD/
>>>>> and put SSD pul in from of HDD pool (cache tier)
>>>>>
>>>>> now we receive very bad performance results from cluster.
>>>>> Even with rados bench we received very unstable performance with even
>>>>> zero speed. So it's create very big issues for our clients.
>>>>>
>>>>> I try to tune all possible values, including OSD, but still no luck.
>>>>>
>>>>> Also very unbelievble situation, when i do
>>>>> ceph tell... bench on SSD OSD - i receive about 20MB/s
>>>>> If for HDD - 67 MB/s...
>>>>>
>>>>> I don;t understand why cache pools which consist of SSD works so bad...
>>>>> We used Samsung 850 Pro 256 Gb as SSDs
>>>>>
>>>>> Can you guys give me advice please...
>>>>>
>>>>> Also very idiotic thing, when i set cache-mode to forward and try to
>>>>> flush-evict all object (not all object evicted, some busy (locked on KVM
>>>>> sides). but now i receive quite stable results for rados bench
>>>>>
>>>>>  Total time run:         30.275871
>>>>> Total writes made:      2076
>>>>> Write size:             4194304
>>>>> Bandwidth (MB/sec):     274.278
>>>>>
>>>>> Stddev Bandwidth:       75.1445
>>>>> Max bandwidth (MB/sec): 368
>>>>> Min bandwidth (MB/sec): 0
>>>>> Average Latency:        0.232892
>>>>> Stddev Latency:         0.240356
>>>>> Max latency:            2.01436
>>>>> Min latency:            0.0716344
>>>>>
>>>>> Without zeros, etc...  So i don't understand how it's possible.
>>>>>
>>>>> Also interesting thing, when i disable overlay for pool, rados bench
>>>>> become around 70MB/s as for ordinary HDD, but in same time rados bench for
>>>>> SSD pool, which not used anymore show same bad results...
>>>>>
>>>>> So please, give me some direction to deeg...
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com