Re: Odd latency numbers

Christian Balzer <chibi@xxxxxxx> · Thu, 16 Mar 2017 09:31:20 +0900

Hello,

On Wed, 15 Mar 2017 16:49:00 +0000 Rhian Resnick wrote:

> Morning all,
> 
> 
> We starting to apply load to our test cephfs system and are noticing some odd latency numbers. We are using erasure coding for the cold data pools and replication for our our cache tiers (not on ssd yet) . We noticed the following high latency on one node and it seams to be slowing down writes and reads on the cluster.
> 
The pg dump below was massive overkill at this point in time, whereas a
"ceph osd tree" would have probably shown us the topology (where is your
tier, where your EC pool(s)?).
Same for a "ceph osd pool ls detail".

So if we were to assume that node is you cache tier (replica 1?), then the
latencies would make sense. 
But that's guesswork, so describe your cluster in more detail.

And yes, a single slow OSD (stealthily failing drive, etc) can bring a
cluster to its knees. 
This is why many people here tend to get every last bit of info with
collectd and feed it into carbon and graphite/grafana, etc.
This will immediately indicate culprits and allow you to correlate this
with other data, like actual disk/network/cpu load, etc.

For the time being run atop on that node and see if you can reduce the
issue to something like "all disk are busy all the time" or "CPU meltdown".

> 
> Our next step is break out mds, mgr, and mons to different machines but we wanted to start the discussion here.
>

If your nodes (not a single iota of HW/NW info from you) are powerful
enough, breaking out stuff isn't likely to help or a necessity. 

More below.

> 
> Here is a bunch of information you may find useful.
> 
> 
> ceph.conf
> 
> [global]
> fsid = XXXXX
> mon_initial_members = ceph-mon1, ceph-mon2, ceph-mon3
> mon_host = 10.141.167.238,10.141.160.251,10.141.161.249
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> 
> cluster network = 10.85.8.0/22
> public network = 10.141.0.0/16
> 
> # we tested this with bluestore and xfs and have the same results
> [osd]
> enable_experimental_unrecoverable_data_corrupting_features = bluestore
> 
I suppose this is not production in any shape or form.

> Status
> 
>     cluster 8f6ba9d6-314d-4725-bcfa-340e500697f0
>      health HEALTH_OK
>      monmap e2: 3 mons at {ceph-mon1=10.141.167.238:6789/0,ceph-mon2=10.141.160.251:6789/0,ceph-mon3=10.141.161.249:6789/0}
>             election epoch 12, quorum 0,1,2 ceph-mon2,ceph-mon3,ceph-mon1
>       fsmap e30: 1/1/1 up {0=ceph-mon3=up:active}, 2 up:standby
>         mgr active: ceph-mon3 standbys: ceph-mon2, ceph-mon1
>      osdmap e100: 12 osds: 12 up, 12 in
>             flags sortbitwise,require_jewel_osds,require_kraken_osds
>       pgmap v119525: 124 pgs, 6 pools, 471 GB data, 1141 kobjects
>             970 GB used, 2231 GB / 3202 GB avail
>                  124 active+clean
>   client io 11962 B/s rd, 11 op/s rd, 0 op/s wr
> 
At first glance there seem to be way too little PGs here, even given the
low number of OSDs.

> 
> Pool space usage
> 
Irrelevant.
> GLOBAL:
>     SIZE      AVAIL     RAW USED     %RAW USED
>     3202G     2231G         970G         30.31
> POOLS:
>     NAME                ID     USED       %USED     MAX AVAIL     OBJECTS
>     rbd                 0           0         0          580G           0
>     cephfs-hot          1      76137M     11.35          580G      466451
>     cephfs-cold         2        397G     25.48         1161G      650158
>     cephfs_metadata     3      47237k         0          580G       52275
>     one-hot             4           0         0          580G           0
>     one                 5           0         0         1161G           0
> 
An aside, how happy are you with OpenNebula and Ceph?
I found that the lack of a migration network option in ON is a show
stopper for us. 

> 
> OSD Performance and Latency
> 
> osd commit_latency(ms) apply_latency(ms)
>   9                  1                 1
>   8                  1                 1
>   0                 13                13
>  11                  1                 1
>   1                 38                38
>  10                  2                 2
>   2                 21                21
>   3                  2                 2
>   4                 20                20
>   5                  1                 1
>   6                  1                 1
>   7                  1                 1
> 
I found these counters to be less than reliable or at least relevant
unless there is constant activity and they are read frequently as well.

For example on a cluster where the HDD based OSDs are basically dormant
write wise most of the time while the SSD based cache tier is very busy:
---
 16                    76                  124 
 17                    66                   99 
 18                     0                    1 
 19                     0                    0 
---

The first two are HDD OSD, the 2nd two are SSD OSDs in the cache tier.
And I can assure that the HDD based OSDs (which have journal OSD and are
really RAID10s behind a 4GB HW cache raid controller) are not that slow.

[snip]

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com