Re: Odd latency numbers

Rhian Resnick <rresnick@xxxxxxx> · Thu, 16 Mar 2017 12:53:55 +0000

Regarding opennebula it is working, we do find the network functionality less then flexible. We would prefer the orchestration layer allow each primary group to create a network infrastructure internally to meet their needs and then automatically provide
 nat from one or more public ip addresses (think aws and azure). This doesn't seem to be implemented at this time and will likely require manual intervention per group of users to resolve. Otherwise we like the software and find it much more lightweight then
 openstack. We need a tool that can be managed by a very small team and opennebula meets that goal.

Thanks for checking this out this data for our test cluster, it isn't production so yes we are throwing the spaghetti on the wall trying to make sure our we are able to handle issues as they come up.

We already planned to increase the pg count and have done so. (thanks)

Here is our osd tree, as this is test we are currently sharing the osd disks for cache tier (replica 3) and data (erasure), some more hardware is on the way so we can test the using SSD's.

We have been reviewing atop, iostat, sar, and our snmp monitoring (not granular enough) and have confirmed the disks on this particular node are under a higher load then the others. We will likely take the time to deploy graphite since it will help with
 another project also. On speculation that was discussed this morning is a bad cache battery on the perc card in ceph-mon1 which could explain the +10 ms latency we see on all for drives. (Wouldn't be ceph at all in this case)

ID WEIGHT  TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 3.12685 root default                                         
-2 1.08875     host ceph-mon1                                   
 0 0.27219         osd.0           up  1.00000          1.00000 
 1 0.27219         osd.1           up  1.00000          1.00000 
 2 0.27219         osd.2           up  1.00000          1.00000 
 4 0.27219         osd.4           up  1.00000          1.00000 
-3 0.94936     host ceph-mon2                                   
 3 0.27219         osd.3           up  1.00000          1.00000 
 5 0.27219         osd.5           up  1.00000          1.00000 
 7 0.27219         osd.7           up  1.00000          1.00000 
 9 0.13280         osd.9           up  1.00000          1.00000 
-4 1.08875     host ceph-mon3                                   
 6 0.27219         osd.6           up  1.00000          1.00000 
 8 0.27219         osd.8           up  1.00000          1.00000 
10 0.27219         osd.10          up  1.00000          1.00000 
11 0.27219         osd.11          up  1.00000          1.00000 

Rhian Resnick

Assistant Director Middleware and HPC

Office of Information Technology

Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

From: Christian Balzer <chibi@xxxxxxx>

Sent: Wednesday, March 15, 2017 8:31 PM

To: ceph-users@xxxxxxxxxxxxxx

Cc: Rhian Resnick

Subject: Re:  Odd latency numbers

Hello,

On Wed, 15 Mar 2017 16:49:00 +0000 Rhian Resnick wrote:

> Morning all,

> 

> 

> We starting to apply load to our test cephfs system and are noticing some odd latency numbers. We are using erasure coding for the cold data pools and replication for our our cache tiers (not on ssd yet) . We noticed the following high latency on one node
 and it seams to be slowing down writes and reads on the cluster.

> 

The pg dump below was massive overkill at this point in time, whereas a

"ceph osd tree" would have probably shown us the topology (where is your

tier, where your EC pool(s)?).

Same for a "ceph osd pool ls detail".

So if we were to assume that node is you cache tier (replica 1?), then the

latencies would make sense. 

But that's guesswork, so describe your cluster in more detail.

And yes, a single slow OSD (stealthily failing drive, etc) can bring a

cluster to its knees. 

This is why many people here tend to get every last bit of info with

collectd and feed it into carbon and graphite/grafana, etc.

This will immediately indicate culprits and allow you to correlate this

with other data, like actual disk/network/cpu load, etc.

For the time being run atop on that node and see if you can reduce the

issue to something like "all disk are busy all the time" or "CPU meltdown".

> 

> Our next step is break out mds, mgr, and mons to different machines but we wanted to start the discussion here.

>

If your nodes (not a single iota of HW/NW info from you) are powerful

enough, breaking out stuff isn't likely to help or a necessity. 

More below.

> 

> Here is a bunch of information you may find useful.

> 

> 

> ceph.conf

> 

> [global]

> fsid = XXXXX

> mon_initial_members = ceph-mon1, ceph-mon2, ceph-mon3

> mon_host = 10.141.167.238,10.141.160.251,10.141.161.249

> auth_cluster_required = cephx

> auth_service_required = cephx

> auth_client_required = cephx

> 

> cluster network = 10.85.8.0/22

> public network = 10.141.0.0/16

> 

> # we tested this with bluestore and xfs and have the same results

> [osd]

> enable_experimental_unrecoverable_data_corrupting_features = bluestore

> 

I suppose this is not production in any shape or form.

> Status

> 

>     cluster 8f6ba9d6-314d-4725-bcfa-340e500697f0

>      health HEALTH_OK

>      monmap e2: 3 mons at {ceph-mon1=10.141.167.238:6789/0,ceph-mon2=10.141.160.251:6789/0,ceph-mon3=10.141.161.249:6789/0}

>             election epoch 12, quorum 0,1,2 ceph-mon2,ceph-mon3,ceph-mon1

>       fsmap e30: 1/1/1 up {0=ceph-mon3=up:active}, 2 up:standby

>         mgr active: ceph-mon3 standbys: ceph-mon2, ceph-mon1

>      osdmap e100: 12 osds: 12 up, 12 in

>             flags sortbitwise,require_jewel_osds,require_kraken_osds

>       pgmap v119525: 124 pgs, 6 pools, 471 GB data, 1141 kobjects

>             970 GB used, 2231 GB / 3202 GB avail

>                  124 active+clean

>   client io 11962 B/s rd, 11 op/s rd, 0 op/s wr

> 

At first glance there seem to be way too little PGs here, even given the

low number of OSDs.

> 

> Pool space usage

> 

Irrelevant.

> GLOBAL:

>     SIZE      AVAIL     RAW USED     %RAW USED

>     3202G     2231G         970G         30.31

> POOLS:

>     NAME                ID     USED       %USED     MAX AVAIL     OBJECTS

>     rbd                 0           0         0          580G           0

>     cephfs-hot          1      76137M     11.35          580G      466451

>     cephfs-cold         2        397G     25.48         1161G      650158

>     cephfs_metadata     3      47237k         0          580G       52275

>     one-hot             4           0         0          580G           0

>     one                 5           0         0         1161G           0

> 

An aside, how happy are you with OpenNebula and Ceph?

I found that the lack of a migration network option in ON is a show

stopper for us. 

> 

> OSD Performance and Latency

> 

> osd commit_latency(ms) apply_latency(ms)

>   9                  1                 1

>   8                  1                 1

>   0                 13                13

>  11                  1                 1

>   1                 38                38

>  10                  2                 2

>   2                 21                21

>   3                  2                 2

>   4                 20                20

>   5                  1                 1

>   6                  1                 1

>   7                  1                 1

> 

I found these counters to be less than reliable or at least relevant

unless there is constant activity and they are read frequently as well.

For example on a cluster where the HDD based OSDs are basically dormant

write wise most of the time while the SSD based cache tier is very busy:

---

 16                    76                  124 

 17                    66                   99 

 18                     0                    1 

 19                     0                    0 

---

The first two are HDD OSD, the 2nd two are SSD OSDs in the cache tier.

And I can assure that the HDD based OSDs (which have journal OSD and are

really RAID10s behind a 4GB HW cache raid controller) are not that slow.

[snip]

Christian

-- 

Christian Balzer        Network/Systems Engineer                

chibi@xxxxxxx    Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com