Thanks Nick,
seems ceph has big performance gap on all ssd setup. Software latency can be a bottleneck.
https://ceph.com/planet/the-ceph-and-tcmalloc-performance-story/
http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2015/20150813_S303E_Zhang.pdf
http://events.linuxfoundation.org/sites/events/files/slides/optimizing_ceph_flash.pdf
seems ceph has big performance gap on all ssd setup. Software latency can be a bottleneck.
https://ceph.com/planet/the-ceph-and-tcmalloc-performance-story/
http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2015/20150813_S303E_Zhang.pdf
http://events.linuxfoundation.org/sites/events/files/slides/optimizing_ceph_flash.pdf
Build with jemalloc and try again...
2016-02-12 20:57 GMT+08:00 Nick Fisk <nick@xxxxxxxxxx>:
I will do my best to answer, but some of the questions are starting to stretch the limit of my knowledge
> -----Original Message-----
> From: Huan Zhang [mailto:huan.zhang.jn@xxxxxxxxx]
> Sent: 12 February 2016 12:15
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: Irek Fasikhov <malmyzh@xxxxxxxxx>; ceph-users <ceph-
> users@xxxxxxxx>
> Subject: Re: ceph 9.2.0 SAMSUNG ssd performance issue?
>
> My enviroment:
> 32 cores Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
> 10GiB NICS
> 4 osds/host
>
> My client is database(mysql) direct/sync write per transaction, a little bit
> sensitive to io latency(sync/direct).
Ok, yes, write latency is important here if your DB's will be doing lots of small inserts/updates
> I used sata disk for osd backends, get ~100 iops/4k/1 iodepth, ~10ms io
> latency , similar to one sata disk iops (fio direct=1 sync=1 bs=4k).
>
> To improve the mysql write performance, use ssd to instead, since ssd
> latency is over 100 times to sata,
> But the result is sad to me.
Yes, there is an inherent performance cap in software defined storage, mainly due to the fact you are swapping a SAS cable for networking+code. You will never get raw SSD performance for low queue depth because of this. Although I hope that at some point in the future Ceph should be able to hit about 1000iops with replication.
>
> There are two things still strange to me.
> 1.fio the journal partition, ~77us latency, why filestore-> journal_latency:
> ~1.1ms?
This is most likely due to Ceph not just doing a straight single write. There is also other processing likely happening as well. I'm sure someone a bit more knowledgeable, could probably elaborate a bit more.
> fio --filename=/dev/sda2 --direct=1 --sync=1 --rw=write --bs=4k --
> numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --
> name=journal-test
>
> lat (usec): min=43, max=1503, avg=77.75, stdev=17.42
>
> 2. 1.1ms journal_latency is far better than sata(5-10ms) i used before,
> why ceph end latency is not improved(ssd ~7ms, sata ~10ms)?
The journal write is just a small part of the write process. Ie check crush map, send replica request...and lots more
> 2ms seems make sense to me. is there a way to calculate the total latency,
> like journal_latency+...=total latency?
>
Possibly, but I couldn't even attempt answer this. If you find out, please let me know as I would also find this very useful :-)
One thing you can do is turn the debug logging right up and then in the logs you can see the steps that each IO takes and how long it took.
Which brings me on to my next point, turn all logging down to 0/0 (http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/) . At 4k IO's the overhead of logging is significant.
Other things to try are setting the kernel parameter idle=poll, at the risk of increased power usage and seeing if you can stop your CPU's going into power saving states.
If anybody else has any other good ideas, please step in.
Nick
>
> 2016-02-12 19:28 GMT+08:00 Nick Fisk <nick@xxxxxxxxxx>:
> Write latency of 1.1ms is ok, but not brilliant. What IO size are you testing
> with?
>
> Don't forget if you have a journal latency of 1.1ms, excluding all other latency
> introduced by networking, replication and processing in the OSD code, you
> won't get more than about 900 iops. All the things I mention all add latency
> and so you often see 2-3ms of latency for a replicated write. This in turn will
> limit you to 300-500 iops for directio writes.
>
> The fact you are seeing around 200 could be about right depending on IO
> size, CPU speed and network speed.
>
> Also what is your end use/requirement? This may or may not matter.
>
> Nick
>
> > -----Original Message-----
> > From: Huan Zhang [mailto:huan.zhang.jn@xxxxxxxxx]
> > Sent: 12 February 2016 11:00
> > To: Nick Fisk <nick@xxxxxxxxxx>
> > Cc: Irek Fasikhov <malmyzh@xxxxxxxxx>; ceph-users <ceph-
> > users@xxxxxxxx>
> > Subject: Re: ceph 9.2.0 SAMSUNG ssd performance issue?
> >
> > thanks nick,
> > filestore-> journal_latency: ~1.1ms
> > 214.0/180611
> > 0.0011848669239415096
> >
> > seems ssd write is ok, any other idea is highly appreciated!
> >
> > "filestore": {
> > "journal_queue_max_ops": 300,
> > "journal_queue_ops": 0,
> > "journal_ops": 180611,
> > "journal_queue_max_bytes": 33554432,
> > "journal_queue_bytes": 0,
> > "journal_bytes": 32637888155,
> > "journal_latency": {
> > "avgcount": 180611,
> > "sum": 214.095788552
> > },
> > "journal_wr": 176801,
> > "journal_wr_bytes": {
> > "avgcount": 176801,
> > "sum": 33122885632
> > },
> > "journal_full": 0,
> > "committing": 0,
> > "commitcycle": 14648,
> > "commitcycle_interval": {
> > "avgcount": 14648,
> > "sum": 73299.187956076
> > },
> >
> >
> > 2016-02-12 18:04 GMT+08:00 Nick Fisk <nick@xxxxxxxxxx>:
> >
> >
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> Behalf
> > Of
> > > Huan Zhang
> > > Sent: 12 February 2016 10:00
> > > To: Irek Fasikhov <malmyzh@xxxxxxxxx>
> > > Cc: ceph-users <ceph-users@xxxxxxxx>
> > > Subject: Re: ceph 9.2.0 SAMSUNG ssd performance issue?
> > >
> > > "op_w_latency":
> > > "avgcount": 42991,
> > > "sum": 402.804741329
> > >
> > > 402.0/42991
> > > 0.009350794352306296
> > >
> > > ~9ms latency, that means this ssd not suitable for journal device?
> >
> > I believe that counter includes lots of other operations in the OSD including
> > the journal write. If you want pure journal stats, I would under the
> Filestore-
> > >journal_latency counter
> >
> > >
> > >
> > > "osd": {
> > > "op_wip": 0,
> > > "op": 58683,
> > > "op_in_bytes": 7309042294,
> > > "op_out_bytes": 507137488,
> > > "op_latency": {
> > > "avgcount": 58683,
> > > "sum": 484.302231121
> > > },
> > > "op_process_latency": {
> > > "avgcount": 58683,
> > > "sum": 323.332046552
> > > },
> > > "op_r": 902,
> > > "op_r_out_bytes": 507137488,
> > > "op_r_latency": {
> > > "avgcount": 902,
> > > "sum": 0.793759596
> > > },
> > > "op_r_process_latency": {
> > > "avgcount": 902,
> > > "sum": 0.619918138
> > > },
> > > "op_w": 42991,
> > > "op_w_in_bytes": 7092142080,
> > > "op_w_rlat": {
> > > "avgcount": 38738,
> > > "sum": 334.643717526
> > > },
> > > "op_w_latency": {
> > > "avgcount": 42991,
> > > "sum": 402.804741329
> > > },
> > > "op_w_process_latency": {
> > > "avgcount": 42991,
> > > "sum": 260.489972416
> > > },
> > > ...
> > >
> > >
> > > 2016-02-12 15:56 GMT+08:00 Irek Fasikhov <malmyzh@xxxxxxxxx>:
> > > Hi.
> > > You need to read : https://www.sebastien-han.fr/blog/2014/10/10/ceph-
> > > how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> > >
> > >
> > > С уважением, Фасихов Ирек Нургаязович
> > > Моб.: +79229045757
> > >
> > > 2016-02-12 10:41 GMT+03:00 Huan Zhang <huan.zhang.jn@xxxxxxxxx>:
> > > Hi,
> > >
> > > ceph VERY SLOW with 24 osd(SAMSUNG ssd).
> > > fio /dev/rbd0 iodepth=1 direct=1 IOPS only ~200
> > > fio /dev/rbd0 iodepth=32 direct=1 IOPS only ~3000
> > >
> > > But test single ssd deive with fio:
> > > fio iodepth=1 direct=1 IOPS ~15000
> > > fio iodepth=32 direct=1 IOPS ~30000
> > >
> > > Why ceph SO SLOW? Could you give me some help?
> > > Appreciated!
> > >
> > >
> > > My Enviroment:
> > > [root@szcrh-controller ~]# ceph -s
> > > cluster eb26a8b9-e937-4e56-a273-7166ffaa832e
> > > health HEALTH_WARN
> > > 1 mons down, quorum 0,1,2,3,4
> > ceph01,ceph02,ceph03,ceph04,ceph05
> > > monmap e1: 6 mons at {ceph01=
> > >
> >
> 10.10.204.144:6789/0,ceph02=10.10.204.145:6789/0,ceph03=10.10.204.146:67
> > >
> >
> 89/0,ceph04=10.10.204.147:6789/0,ceph05=10.10.204.148:6789/0,ceph06=0.0
> > > .0.0:0/5
> > > }
> > > election epoch 6, quorum 0,1,2,3,4
> > > ceph01,ceph02,ceph03,ceph04,ceph05
> > > osdmap e114: 24 osds: 24 up, 24 in
> > > flags sortbitwise
> > > pgmap v2213: 1864 pgs, 3 pools, 49181 MB data, 4485 objects
> > > 144 GB used, 42638 GB / 42782 GB avail
> > > 1864 active+clean
> > >
> > > [root@ceph03 ~]# lsscsi
> > > [0:0:6:0] disk ATA SAMSUNG MZ7KM1T9 003Q /dev/sda
> > > [0:0:7:0] disk ATA SAMSUNG MZ7KM1T9 003Q /dev/sdb
> > > [0:0:8:0] disk ATA SAMSUNG MZ7KM1T9 003Q /dev/sdc
> > > [0:0:9:0] disk ATA SAMSUNG MZ7KM1T9 003Q /dev/sdd
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com