НА: How to improve latencies and per-VM performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

1) XFS frag is not very much - from 6 to 10-12%, one osd have 19%. Is this values too high to badly influence
to performance?

2) About 32/48Gb RAM. Cluster was created from slightly old HW, mostly on Intel 5520 platform
  - 3 nodes is Intel SR2612URR platform,
  - 1 node - Supermicro 6026T-URF
  - 1 node - even on 5400 chipset,
3-channel memory nodes have filled all 12 memory slots by 4Gb DIMMS, so no futher extension
possible, of we have to buy more capacity modules. Node on 5400 chipset also have no more than 32Gb RAM.
I think, adding more memory do not speed up read operations too much, especially in this HW config.
We have 216Gb ram total, so lets imagine the best case, when all memory are used for page cache.
With 76Tb raw and 44Tb used capacity, we have only 0.49% cache to storage ratio, and even doubling RAM
size will rise ratio to ~1%. I think, the effectiveness of osd read cache are not so much at size of our data.
Maybe things will be better with increasing number of nodes.

3) Thanks for the hint, I'll try to check this in couple of days.

4) >run htop or vmstat/iostat to determinate whether it’s the CPU that’s getting maxed out or not.
No. At sustained load (without rebalancing, srcub or other cluster self-activity) even on worst 5400
node I have ~50-60% cpu idle. Also, when I additionaly run fio-rbd with 4K random load, cluster iops
rises to 15-30k, so I think, the CPU isn't a bottleneck.

5) Cluster shared the same 10Gbit network without separation of client and cluster networks (in terms of ceph).

I' ve added this recommended sysctl parameters:
net.ipv4.tcp_low_latency = 1
net.core.rmem_default = 10000000
net.core.wmem_default = 10000000
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

On the NIC (Intel X520DA) we turn off interrupt coalescing to decrease latency. Maybe it gaves the thing worse,
but most offload features are also disabled. The current kernel it 3.16 backported from Debian Jessie, have a
serious bug with tagged bond interfaces and HW offloading - frequent kernell oops and awful performance.
Even a scatter/gather offloas are disabled, so the throughput on the interface is no more than ~5-6Gbit.
Also I have no tools to measure sub-usecond network latency, but most pings are <0.08ms (~70-80us)
and nload shows ~150Mbit traffic at usual sustained load. At the other side, running rados bench with 4M
object size from 10Gbit client shows >1Gbyte/s reading. So I think, that network under ceph is OK.

On the other hand, clients (VM on ~25 hosts under KVM) are on 1Gbit and have <0.15ms pings.
Maybe, more tight coupling hypervisor hosts with ceph hosts to the same 10Gbit network segment
will improve latencies slightly - we will check this sooner.

6) At your perf dump, I can see:
  -  230178/406051353 = 0.000567s =  0.567ms - avg journal latency
  - 4337608/272537987 = 15.92 ms - avg operation latency
  - 758059/111672059 = 6.79 ms - avg read operation latency
  - 174762/9308193 = 18.78 ms - avg write operation latency

And mine is (min/avg/max, ms):
- Journal operation latency = 0.63/0.78/1.04
- Operations latency = 3.76/8.87/18.39
- Read operations latency = 1.02/2.25/4.49
- Write operations latency = 9.61/23.94/50.30

As I can see fron journal latency, you have journals on ssd? What other HW do you use?
How many iops and how your cluster perform with 4K random read at low queue depth?

Thanks!
Megov Igor
CIO, Yuterra


От: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> от имени Josef Johansson <josef86@xxxxxxxxx>
Отправлено: 21 мая 2015 г. 1:39
Кому: Межов Игорь Александрович
Копия: ceph-users
Тема: Re: How to improve latencies and per-VM performance and latencies
 
Hi,

Just to add, there’s also a collectd plugin at https://github.com/rochaporto/collectd-ceph.

Things to check when you have slow read performance is:

*) how much defragmentation on those xfs-partitions? With some workloads you get high values pretty quick.
for osd in $(grep 'osd/ceph' /etc/mtab | cut -d ' ' -f 1); do sudo xfs_db -c frag -r $osd;done
*) 32/48GB RAM on the OSDs, could be increased. So as XFS is used and all the objects are files, ceph uses the linux file cache.
If your data set fits into that cache pretty much, you can gain _alot_ of read performance since there’s pretty much no reads from the drives. We’re at 128GB per OSD right now. Compared with the options at hand this could be a cheap way of increasing the performance. It won’t help you out when you’re doing deep-scrubs or recovery though.
*) turn off logging
[global]
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
[osd]
       debug lockdep = 0/0
       debug context = 0/0
       debug crush = 0/0
       debug buffer = 0/0
       debug timer = 0/0
       debug journaler = 0/0
       debug osd = 0/0
       debug optracker = 0/0
       debug objclass = 0/0
       debug filestore = 0/0
       debug journal = 0/0
       debug ms = 0/0
       debug monc = 0/0
       debug tp = 0/0
       debug auth = 0/0
       debug finisher = 0/0
       debug heartbeatmap = 0/0
       debug perfcounter = 0/0
       debug asok = 0/0
       debug throttle = 0/0

*) run htop or vmstat/iostat to determinate whether it’s the CPU that’s getting maxed out or not.
*) just double check the performance and latencies on the network (do it for low and high MTU, just to make sure, it’s tough to optimise a lot and get bitten by it ;)

2) I don’t see anything in the help section about it
sudo ceph --admin-daemon /var/run/ceph/ceph-osd.$osd.asok help
an easy way of getting the osds if you want to change something globally
for osd in $(grep 'osd/ceph' /etc/mtab | cut -d ' ' -f 2 | cut -d '-' -f 2); do echo $osd;done

3) this is on one of the OSDs, about the same size as yours but sata drives for backing ( a bit more cpu and memory though):

sudo ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok perf dump | grep -A 1 -e op_latency -e op_[rw]_latency -e op_[rw]_process_latency -e journal_latency
      "journal_latency": { "avgcount": 406051353,
          "sum": 230178.927806000},
--
      "op_latency": { "avgcount": 272537987,
          "sum": 4337608.211040000},
--
      "op_r_latency": { "avgcount": 111672059,
          "sum": 758059.732591000},
--
      "op_w_latency": { "avgcount": 9308193,
          "sum": 174762.139637000},
--
      "subop_latency": { "avgcount": 273742609,
          "sum": 1084598.823585000},
--
      "subop_w_latency": { "avgcount": 273742609,
          "sum": 1084598.823585000},

Cheers
Josef

On 20 May 2015, at 10:20, Межов Игорь Александрович <megov@xxxxxxxxxx> wrote:

Hi!

1. Use it at your own risk. I'm not responsible to any damage, you can get by running thos script

2. What is it for. 
Ceph osd daemon have so called 'admin socket' - a local (to osd host) unix socket, that we can
use to issue commant to that osd. The script connects to a list od osd hosts (now it os hardcoded in
source code, but it's easily changeable) by ssh, lists all admin sockets from /var/run/ceph, grep
socket names for osd numbers, and issue 'perf dump' command to all osds. Json output parsed
by standard python libs ans some latency parameters extracted from it. They coded in json as tuples,
containing  total amount of time in milliseconds and count of events. So dividing time to count we get
average latency for one or more ceph operations. The min/max/avg are counted for every host and
whole cluster, and latency of every osd compared to minimal value of cluster (or host) and colorized
to easily detect too high values. 
You can check usage example in comments at the top of the script and change hardcoded values,
that are also gathered at the top.

3. I use script on Ceph Firefly 0.80.7, but think that it will work on any release, that supports
admin socket connection to osd, 'perf dump' command and the same json output structure.

4. As we connects to osd hosts by ssh in a one-by-one, the script is slow, especially when you have
more osd hosts. Also, als osd from a host are output in a one row, so if you have >12 osds per host,
it will mess output slightly.

PS: This is my first python script, so suggestions and improvements are welcome ;)


Megov Igor
CIO, Yuterra

________________________________________
От: Michael Kuriger <mk7193@xxxxxx>
Отправлено: 19 мая 2015 г. 18:51
Кому: Межов Игорь Александрович
Тема: Re: How to improve latencies and per-VM performance  and latencies

Awesome!  I would be interested in doing this as well.  Care to share how
your script works?

Thanks!




Michael Kuriger
Sr. Unix Systems Engineer
* mk7193@xxxxxx |( 818-649-7235





On 5/19/15, 6:31 AM, "Межов Игорь Александрович" <megov@xxxxxxxxxx> wrote:

Hi!

Seeking performance improvement in our cluster (Firefly 0.80.7 on Wheezy,
5 nodes, 58 osds), I wrote
a small python script, that walks through ceph nodes and issue 'perf
dump' command on osd admin
sockets. It extracts *_latency tuples, calculate min/max/avg, compare osd
perf metrics with min/avg
of whole cluster or same host and display result in table form. The goal
- to check where the most latency is.

The hardware is not new and shiny:
- 5 nodes * 10-12 OSDs each
- Intel E5520@2.26/32-48Gb DDR3-1066 ECC
- 10Gbit X520DA interconnect
- Intel DC3700 200Gb as a system volume + journals, connected to sata2
onboard in ahci mode
- Intel RS2MB044 / RS2BL080 SAS RAID in RAID0 per drive mode, WT, disk
cache disabled
- bunch of 1Tb or 2Tb various WD Black drives, 58 disks, 76Tb total
- replication = 3, filestore on xfs
- shared client and cluster 10Gbit network
- cluster used as rbd storage for VMs
- rbd_cache is on by 'cache=writeback' in libvirt (I suppose, that it is
true ;))
- no special tuning in ceph.conf:

osd mount options xfs = rw,noatime,inode64
osd disk threads = 2
osd op threads = 8
osd max backfills = 2
osd recovery max active = 2

I get rather slow read performance from within VM, especially with QD=1,
so many VMs are running slowly.
I think, that this HW config can perform better, as I got 10-12k iops
with QD=32 from time to time.

So I have some questions:
1. Am I right, that osd perfs are cumulative and counting up from OSD
start?
2. Is any way to reset perf counters without restating OSD daemon? Maybe
a command through admin socket?
3. What latencies should I expect from my config, or, what latencies you
have on yours clusters?
Just an example or as a reference to compare with my values. I've
interesting mostly in
- 'op_latency',
- 'op_[r|w]_latency',
- 'op_[r|w]_process_latency'
- 'journal_latency'
But other parameters, like 'apply_latency' or
'queue_transaction_latency_avg' are also interesting to compare.
4. Where I have to look firstly, if I need to improve QD=1 (i. e.
per-VM) performance.

Thanks!

Megov Igor
CIO, Yuterra
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinf
o.cgi_ceph-2Dusers-2Dceph.com&d=AwICAg&c=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSnc
m6Vfn0C_UQ&r=CSYA9OS6Qd7fQySI2LDvlQ&m=c0lu_hzIfU4AXi0gnwLzaOeWo7EFrFwlKjKf
K-iihGg&s=o-hDZx1--UnZ27K2XL7-w08f2fwTwargpeiWtFS87L0&e=

<getosdstat.py.gz>_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux