Re: How to improve latencies and per-VM performance and latencies

Josef Johansson <josef86@xxxxxxxxx> · Thu, 21 May 2015 00:39:00 +0200

Hi,
Just to add, there’s also a collectd plugin at https://github.com/rochaporto/collectd-ceph.

Things to check when you have slow read performance is:

*) how much defragmentation on those xfs-partitions? With some workloads you get high values pretty quick.
for osd in $(grep 'osd/ceph' /etc/mtab | cut -d ' ' -f 1); do sudo xfs_db -c frag -r $osd;done
*) 32/48GB RAM on the OSDs, could be increased. So as XFS is used and all the objects are files, ceph uses the linux file cache.
If your data set fits into that cache pretty much, you can gain _alot_ of read performance since there’s pretty much no reads from the drives. We’re at 128GB per OSD right now. Compared with the options at hand this could be a cheap way of increasing the performance. It won’t help you out when you’re doing deep-scrubs or recovery though.
*) turn off logging
[global]
	debug_lockdep = 0/0
	debug_context = 0/0
	debug_crush = 0/0
	debug_buffer = 0/0
	debug_timer = 0/0
	debug_filer = 0/0
	debug_objecter = 0/0
	debug_rados = 0/0
	debug_rbd = 0/0
	debug_journaler = 0/0
	debug_objectcatcher = 0/0
	debug_client = 0/0
	debug_osd = 0/0
	debug_optracker = 0/0
	debug_objclass = 0/0
	debug_filestore = 0/0
	debug_journal = 0/0
	debug_ms = 0/0
	debug_monc = 0/0
	debug_tp = 0/0
	debug_auth = 0/0
	debug_finisher = 0/0
	debug_heartbeatmap = 0/0
	debug_perfcounter = 0/0
	debug_asok = 0/0
	debug_throttle = 0/0
	debug_mon = 0/0
	debug_paxos = 0/0
	debug_rgw = 0/0
[osd]
       debug lockdep = 0/0
       debug context = 0/0
       debug crush = 0/0
       debug buffer = 0/0
       debug timer = 0/0
       debug journaler = 0/0
       debug osd = 0/0
       debug optracker = 0/0
       debug objclass = 0/0
       debug filestore = 0/0
       debug journal = 0/0
       debug ms = 0/0
       debug monc = 0/0
       debug tp = 0/0
       debug auth = 0/0
       debug finisher = 0/0
       debug heartbeatmap = 0/0
       debug perfcounter = 0/0
       debug asok = 0/0
       debug throttle = 0/0

*) run htop or vmstat/iostat to determinate whether it’s the CPU that’s getting maxed out or not.
*) just double check the performance and latencies on the network (do it for low and high MTU, just to make sure, it’s tough to optimise a lot and get bitten by it ;)

2) I don’t see anything in the help section about it
sudo ceph --admin-daemon /var/run/ceph/ceph-osd.$osd.asok help
an easy way of getting the osds if you want to change something globally
for osd in $(grep 'osd/ceph' /etc/mtab | cut -d ' ' -f 2 | cut -d '-' -f 2); do echo $osd;done

3) this is on one of the OSDs, about the same size as yours but sata drives for backing ( a bit more cpu and memory though):

sudo ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok perf dump | grep -A 1 -e op_latency -e op_[rw]_latency -e op_[rw]_process_latency -e journal_latency
      "journal_latency": { "avgcount": 406051353,
          "sum": 230178.927806000},
--
      "op_latency": { "avgcount": 272537987,
          "sum": 4337608.211040000},
--
      "op_r_latency": { "avgcount": 111672059,
          "sum": 758059.732591000},
--
      "op_w_latency": { "avgcount": 9308193,
          "sum": 174762.139637000},
--
      "subop_latency": { "avgcount": 273742609,
          "sum": 1084598.823585000},
--
      "subop_w_latency": { "avgcount": 273742609,
          "sum": 1084598.823585000},

Cheers
Josef

On 20 May 2015, at 10:20, Межов Игорь Александрович <megov@xxxxxxxxxx> wrote:

Hi!

1. Use it at your own risk. I'm not responsible to any damage, you can get by running thos script

2. What is it for. 
Ceph osd daemon have so called 'admin socket' - a local (to osd host) unix socket, that we can
use to issue commant to that osd. The script connects to a list od osd hosts (now it os hardcoded in
source code, but it's easily changeable) by ssh, lists all admin sockets from /var/run/ceph, grep
socket names for osd numbers, and issue 'perf dump' command to all osds. Json output parsed
by standard python libs ans some latency parameters extracted from it. They coded in json as tuples,
containing  total amount of time in milliseconds and count of events. So dividing time to count we get
average latency for one or more ceph operations. The min/max/avg are counted for every host and
whole cluster, and latency of every osd compared to minimal value of cluster (or host) and colorized
to easily detect too high values. 
You can check usage example in comments at the top of the script and change hardcoded values,
that are also gathered at the top.

3. I use script on Ceph Firefly 0.80.7, but think that it will work on any release, that supports
admin socket connection to osd, 'perf dump' command and the same json output structure.

4. As we connects to osd hosts by ssh in a one-by-one, the script is slow, especially when you have
more osd hosts. Also, als osd from a host are output in a one row, so if you have >12 osds per host,
it will mess output slightly.

PS: This is my first python script, so suggestions and improvements are welcome ;)

Megov Igor
CIO, Yuterra

________________________________________
От: Michael Kuriger <mk7193@xxxxxx>
Отправлено: 19 мая 2015 г. 18:51
Кому: Межов Игорь Александрович
Тема: Re:  How to improve latencies and per-VM performance  and latencies

Awesome!  I would be interested in doing this as well.  Care to share how
your script works?

Thanks!

Michael Kuriger
Sr. Unix Systems Engineer
* mk7193@xxxxxx |( 818-649-7235

On 5/19/15, 6:31 AM, "Межов Игорь Александрович" <megov@xxxxxxxxxx> wrote:

Hi!

Seeking performance improvement in our cluster (Firefly 0.80.7 on Wheezy,
5 nodes, 58 osds), I wrote
a small python script, that walks through ceph nodes and issue 'perf
dump' command on osd admin
sockets. It extracts *_latency tuples, calculate min/max/avg, compare osd
perf metrics with min/avg
of whole cluster or same host and display result in table form. The goal
- to check where the most latency is.

The hardware is not new and shiny:
- 5 nodes * 10-12 OSDs each
- Intel E5520@2.26/32-48Gb DDR3-1066 ECC
- 10Gbit X520DA interconnect
- Intel DC3700 200Gb as a system volume + journals, connected to sata2
onboard in ahci mode
- Intel RS2MB044 / RS2BL080 SAS RAID in RAID0 per drive mode, WT, disk
cache disabled
- bunch of 1Tb or 2Tb various WD Black drives, 58 disks, 76Tb total
- replication = 3, filestore on xfs
- shared client and cluster 10Gbit network
- cluster used as rbd storage for VMs
- rbd_cache is on by 'cache=writeback' in libvirt (I suppose, that it is
true ;))
- no special tuning in ceph.conf:

osd mount options xfs = rw,noatime,inode64
osd disk threads = 2
osd op threads = 8
osd max backfills = 2
osd recovery max active = 2

I get rather slow read performance from within VM, especially with QD=1,
so many VMs are running slowly.
I think, that this HW config can perform better, as I got 10-12k iops
with QD=32 from time to time.

So I have some questions:
1. Am I right, that osd perfs are cumulative and counting up from OSD
start?
2. Is any way to reset perf counters without restating OSD daemon? Maybe
a command through admin socket?
3. What latencies should I expect from my config, or, what latencies you
have on yours clusters?
Just an example or as a reference to compare with my values. I've
interesting mostly in
- 'op_latency',
- 'op_[r|w]_latency',
- 'op_[r|w]_process_latency'
- 'journal_latency'
But other parameters, like 'apply_latency' or
'queue_transaction_latency_avg' are also interesting to compare.
4. Where I have to look firstly, if I need to improve QD=1 (i. e.
per-VM) performance.

Thanks!

Megov Igor
CIO, Yuterra
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinf
o.cgi_ceph-2Dusers-2Dceph.com&d=AwICAg&c=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSnc
m6Vfn0C_UQ&r=CSYA9OS6Qd7fQySI2LDvlQ&m=c0lu_hzIfU4AXi0gnwLzaOeWo7EFrFwlKjKf
K-iihGg&s=o-hDZx1--UnZ27K2XL7-w08f2fwTwargpeiWtFS87L0&e=

<getosdstat.py.gz>_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com