НА: How to improve latencies and per-VM performance

Межов Игорь Александрович <megov@xxxxxxxxxx> · Thu, 21 May 2015 08:28:10 +0000

Hi!

1) XFS frag is not very much - from 6 to 10-12%, one osd have 19%. Is this values too high to badly influence

to performance?

2) About 32/48Gb RAM. Cluster was created from slightly old HW, mostly on Intel 5520 platform

  - 3 nodes is Intel SR2612URR platform, 

  - 1 node - Supermicro 6026T-URF

  - 1 node - even on 5400 chipset, 

3-channel memory nodes have filled all 12 memory slots by 4Gb DIMMS, so no futher extension

possible, of we have to buy more capacity modules. Node on 5400 chipset also have no more than 32Gb RAM.

I think, adding more memory do not speed up read operations too much, especially in this HW config.

We have 216Gb ram total, so lets imagine the best case, when
all memory are used for page cache.

With 76Tb raw and 44Tb used capacity, we have only 0.49% cache to storage ratio, and even doubling RAM

size will rise ratio to ~1%. I think, the effectiveness of osd read cache are not so much at size of our data.

Maybe things will be better with increasing number of nodes.

3) Thanks for the hint, I'll try to check this in couple of days.

4) >run htop or vmstat/iostat to determinate whether it’s the CPU that’s getting maxed out or not.

No. At sustained load (without rebalancing, srcub or other cluster self-activity) even on worst 5400

node I have ~50-60% cpu idle. Also, when I additionaly run fio-rbd with 4K random load, cluster iops

rises to 15-30k, so I think, the CPU isn't a bottleneck.

5) Cluster shared the same 10Gbit network without separation of client and cluster networks (in terms of ceph).

I' ve added this recommended sysctl parameters:

net.ipv4.tcp_low_latency = 1

net.core.rmem_default = 10000000

net.core.wmem_default = 10000000

net.core.rmem_max = 16777216

net.core.wmem_max = 16777216

On the NIC (Intel X520DA) we turn off interrupt coalescing to decrease latency. Maybe it gaves the thing worse,

but most offload features are also disabled. The current kernel it 3.16 backported from Debian Jessie, have a

serious bug with tagged bond interfaces and HW offloading - frequent kernell oops and awful performance.

Even a scatter/gather offloas are disabled, so the throughput on the interface is no more than ~5-6Gbit.

Also I have no tools to measure sub-usecond network latency, but most pings are <0.08ms (~70-80us)

and nload shows ~150Mbit traffic at usual sustained load. At the other side, running rados bench with 4M

object size from 10Gbit client shows >1Gbyte/s reading. So I think, that network under ceph is OK.

On the other hand, clients (VM on ~25 hosts under KVM) are on 1Gbit and have <0.15ms pings.

Maybe, more tight coupling hypervisor hosts with ceph hosts to the same 10Gbit network segment

will improve latencies slightly - we will check this sooner.

6) At your perf dump, I can see:

  -  230178/406051353 = 0.000567s =  0.567ms - avg journal latency

  - 4337608/272537987 = 15.92 ms - avg operation latency

  - 758059/111672059 = 6.79 ms - avg read operation latency

  - 174762/9308193 = 18.78 ms - avg write operation latency

And mine is (min/avg/max, ms):

- Journal operation latency = 0.63/0.78/1.04 

- Operations latency = 3.76/8.87/18.39

- Read operations latency = 1.02/2.25/4.49

- Write operations latency = 9.61/23.94/50.30

As I can see fron journal latency, you have journals on ssd? What other HW do you use?

How many iops and how your cluster perform with 4K random read at low queue depth?

Thanks!

Megov
 Igor

CIO,
 Yuterra

От: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> от имени Josef Johansson <josef86@xxxxxxxxx>

Отправлено: 21 мая 2015 г. 1:39

Кому: Межов Игорь Александрович

Копия: ceph-users

Тема: Re:  How to improve latencies and per-VM performance and latencies

Hi,

Just to add, there’s also a collectd plugin at https://github.com/rochaporto/collectd-ceph.

Things to check when you have slow read performance is:

*) how much defragmentation on those xfs-partitions? With some workloads you get high values pretty quick.
for osd in $(grep 'osd/ceph' /etc/mtab | cut -d ' ' -f 1); do sudo xfs_db -c frag -r $osd;done
*) 32/48GB RAM on the OSDs, could be increased. So as XFS is used and all the objects are files, ceph uses the linux file cache.
If your data set fits into that cache pretty much, you can gain _alot_ of read performance since there’s pretty much no reads from the drives. We’re at 128GB per OSD right now. Compared with the options at hand this could be a cheap way of increasing the
 performance. It won’t help you out when you’re doing deep-scrubs or recovery though.
*) turn off logging
[global]

debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

[osd]

       debug lockdep = 0/0
       debug context = 0/0
       debug crush = 0/0
       debug buffer = 0/0
       debug timer = 0/0
       debug journaler = 0/0
       debug osd = 0/0
       debug optracker = 0/0
       debug objclass = 0/0
       debug filestore = 0/0
       debug journal = 0/0
       debug ms = 0/0
       debug monc = 0/0
       debug tp = 0/0
       debug auth = 0/0
       debug finisher = 0/0
       debug heartbeatmap = 0/0
       debug perfcounter = 0/0
       debug asok = 0/0
       debug throttle = 0/0

*) run htop or vmstat/iostat to determinate whether it’s the CPU that’s getting maxed out or not.
*) just double check the performance and latencies on the network (do it for low and high MTU, just to make sure, it’s tough to optimise a lot and get bitten by it ;)

2) I don’t see anything in the help section about it
sudo ceph --admin-daemon /var/run/ceph/ceph-osd.$osd.asok help

an easy way of getting the osds if you want to change something globally
for osd in $(grep 'osd/ceph' /etc/mtab | cut -d ' ' -f 2 | cut -d '-' -f 2); do echo $osd;done

3) this is on one of the OSDs, about the same size as yours but sata drives for backing ( a bit more cpu and memory though):

sudo ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok perf dump | grep -A 1 -e op_latency -e op_[rw]_latency -e op_[rw]_process_latency -e journal_latency
      "journal_latency": { "avgcount": 406051353,
          "sum": 230178.927806000},
--
      "op_latency": { "avgcount": 272537987,
          "sum": 4337608.211040000},
--
      "op_r_latency": { "avgcount": 111672059,
          "sum": 758059.732591000},
--
      "op_w_latency": { "avgcount": 9308193,
          "sum": 174762.139637000},
--
      "subop_latency": { "avgcount": 273742609,
          "sum": 1084598.823585000},
--
      "subop_w_latency": { "avgcount": 273742609,
          "sum": 1084598.823585000},

Cheers
Josef

On 20 May 2015, at 10:20, Межов Игорь Александрович <megov@xxxxxxxxxx> wrote:

Hi!

1.
 Use it at your own risk. I'm not responsible to any damage, you can get by running thos script

2.
 What is it for. 

Ceph
 osd daemon have so called 'admin socket' - a local (to osd host) unix socket, that we can

use
 to issue commant to that osd. The script connects to a list od osd hosts (now it os hardcoded in

source
 code, but it's easily changeable) by ssh, lists all admin sockets from /var/run/ceph, grep

socket
 names for osd numbers, and issue 'perf dump' command to all osds. Json output parsed

by
 standard python libs ans some latency parameters extracted from it. They coded in json as tuples,

containing
  total amount of time in milliseconds and count of events. So dividing time to count we get

average
 latency for one or more ceph operations. The min/max/avg are counted for every host and

whole
 cluster, and latency of every osd compared to minimal value of cluster (or host) and colorized

to
 easily detect too high values. 

You
 can check usage example in comments at the top of the script and change hardcoded values,

that
 are also gathered at the top.

3.
 I use script on Ceph Firefly 0.80.7, but think that it will work on any release, that supports

admin
 socket connection to osd, 'perf dump' command and the same json output structure.

4.
 As we connects to osd hosts by ssh in a one-by-one, the script is slow, especially when you have

more
 osd hosts. Also, als osd from a host are output in a one row, so if you have >12 osds per host,

it
 will mess output slightly.

PS:
 This is my first python script, so suggestions and improvements are welcome ;)

Megov
 Igor

CIO,
 Yuterra

________________________________________

От:
 Michael Kuriger <mk7193@xxxxxx>

Отправлено:
 19 мая 2015 г. 18:51

Кому:
 Межов Игорь Александрович

Тема:
 Re:  How to improve latencies and per-VM performance  and latencies

Awesome!
  I would be interested in doing this as well.  Care to share how

your
 script works?

Thanks!

Michael
 Kuriger

Sr.
 Unix Systems Engineer

* mk7193@xxxxxx |(
 818-649-7235

On
 5/19/15, 6:31 AM, "Межов Игорь Александрович" <megov@xxxxxxxxxx>
 wrote:

Hi!

Seeking performance improvement in our cluster (Firefly 0.80.7 on Wheezy,

5 nodes, 58 osds), I wrote

a small python script, that walks through ceph nodes and issue 'perf

dump' command on osd admin

sockets. It extracts *_latency tuples, calculate min/max/avg, compare osd

perf metrics with min/avg

of whole cluster or same host and display result in table form. The goal

- to check where the most latency is.

The hardware is not new and shiny:

- 5 nodes * 10-12 OSDs each

- Intel E5520@2.26/32-48Gb DDR3-1066 ECC

- 10Gbit X520DA interconnect

- Intel DC3700 200Gb as a system volume + journals, connected to sata2

onboard in ahci mode

- Intel RS2MB044 / RS2BL080 SAS RAID in RAID0 per drive mode, WT, disk

cache disabled

- bunch of 1Tb or 2Tb various WD Black drives, 58 disks, 76Tb total

- replication = 3, filestore on xfs

- shared client and cluster 10Gbit network

- cluster used as rbd storage for VMs

- rbd_cache is on by 'cache=writeback' in libvirt (I suppose, that it is

true ;))

- no special tuning in ceph.conf:

osd mount options xfs = rw,noatime,inode64

osd disk threads = 2

osd op threads = 8

osd max backfills = 2

osd recovery max active = 2

I get rather slow read performance from within VM, especially with QD=1,

so many VMs are running slowly.

I think, that this HW config can perform better, as I got 10-12k iops

with QD=32 from time to time.

So I have some questions:

1. Am I right, that osd perfs are cumulative and counting up from OSD

start?

2. Is any way to reset perf counters without restating OSD daemon? Maybe

a command through admin socket?

3. What latencies should I expect from my config, or, what latencies you

have on yours clusters?

Just an example or as a reference to compare with my values. I've

interesting mostly in

- 'op_latency',

- 'op_[r|w]_latency',

- 'op_[r|w]_process_latency'

- 'journal_latency'

But other parameters, like 'apply_latency' or

'queue_transaction_latency_avg' are also interesting to compare.

4. Where I have to look firstly, if I need to improve QD=1 (i. e.

per-VM) performance.

Thanks!

Megov Igor

CIO, Yuterra

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinf

o.cgi_ceph-2Dusers-2Dceph.com&d=AwICAg&c=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSnc

m6Vfn0C_UQ&r=CSYA9OS6Qd7fQySI2LDvlQ&m=c0lu_hzIfU4AXi0gnwLzaOeWo7EFrFwlKjKf

K-iihGg&s=o-hDZx1--UnZ27K2XL7-w08f2fwTwargpeiWtFS87L0&e=

<getosdstat.py.gz>_______________________________________________

ceph-users
 mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com