Hi,
I've been look at ceph mds perf counters and I saw the one of my clusters was hugely different from other in number of caps:
rlat inos caps | hsr hcs hcr | writ read actv | recd recy stry purg | segs evts subm
0 3.0M 5.1M | 0 0 595 | 304 4 0 | 0 0 13k 0 | 42 35k 893
0 3.0M 5.1M | 0 0 165 | 1.8k 4 37 | 0 0 13k 0 | 43 36k 302
16 3.0M 5.1M | 0 0 429 | 247 9 4 | 0 0 13k 58 | 38 32k 1.7k
0 3.0M 5.1M | 0 1 213 | 1.2k 0 857 | 0 0 13k 0 | 40 33k 766
23 3.0M 5.1M | 0 0 945 | 445 1 0 | 0 0 13k 0 | 41 34k 1.1k
0 3.0M 5.1M | 0 2 696 | 376 11 0 | 0 0 13k 0 | 43 35k 1.0k
3 2.9M 5.1M | 0 0 601 | 2.0k 6 0 | 0 0 13k 56 | 38 29k 1.2k
0 2.9M 5.1M | 0 0 394 | 272 11 0 | 0 0 13k 0 | 38 30k 758
on another cluster running the same version:
-----mds------ --mds_server-- ---objecter--- -----mds_cache----- ---mds_log----
rlat inos caps | hsr hcs hcr | writ read actv | recd recy stry purg | segs evts subm
2 3.9M 380k | 0 1 266 | 1.8k 0 370 | 0 0 24k 44 | 37 129k 1.5k
I did a perf dump on the active mds:
~# ceph daemon mds.a perf dump mds
{
"mds": {
"request": 2245276724,
"reply": 2245276366,
"reply_latency": {
"avgcount": 2245276366,
"sum": 18750003.074118977
},
"forward": 0,
"dir_fetch": 20217943,
"dir_commit": 555295668,
"dir_split": 0,
"inode_max": 3000000,
"inodes": 3000276,
"inodes_top": 152555,
"inodes_bottom": 279938,
"inodes_pin_tail": 2567783,
"inodes_pinned": 2782064,
"inodes_expired": 308697104,
"inodes_with_caps": 2779658,
"caps": 5147887,
"subtrees": 2,
"traverse": 2582452087,
"traverse_hit": 2338123987,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 16627249,
"traverse_remote_ino": 29276,
"traverse_lock": 2507504,
"load_cent": 18446743868740589422,
"q": 27,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
}
}
and then a session ls to see what clients could be holding that much:
{
"client_metadata" : {
"entity_id" : "admin",
"kernel_version" : "4.4.0-97-generic",
"hostname" : "suppressed"
},
"completed_requests" : 0,
"id" : 1165169,
"num_leases" : 343,
"inst" : "client.1165169 10.0.0.112:0/982172363",
"state" : "open",
"num_caps" : 111740,
"reconnecting" : false,
"replay_requests" : 0
},
{
"state" : "open",
"replay_requests" : 0,
"reconnecting" : false,
"num_caps" : 108125,
"id" : 1236036,
"completed_requests" : 0,
"client_metadata" : {
"hostname" : "suppressed",
"kernel_version" : "4.4.0-97-generic",
"entity_id" : "admin"
},
"num_leases" : 323,
"inst" : "client.1236036 10.0.0.113:0/1891451616"
},
{
"num_caps" : 63186,
"reconnecting" : false,
"replay_requests" : 0,
"state" : "open",
"num_leases" : 147,
"completed_requests" : 0,
"client_metadata" : {
"kernel_version" : "4.4.0-75-generic",
"entity_id" : "admin",
"hostname" : "suppressed"
},
"id" : 1235930,
"inst" : "client.1235930 10.0.0.110:0/2634585537"
},
{
"num_caps" : 2476444,
"replay_requests" : 0,
"reconnecting" : false,
"state" : "open",
"num_leases" : 0,
"completed_requests" : 0,
"client_metadata" : {
"entity_id" : "admin",
"kernel_version" : "4.4.0-75-generic",
"hostname" : "suppressed"
},
"id" : 1659696,
"inst" : "client.1659696 10.0.0.101:0/4005556527"
},
{
"state" : "open",
"replay_requests" : 0,
"reconnecting" : false,
"num_caps" : 2386376,
"id" : 1069714,
"client_metadata" : {
"hostname" : "suppressed",
"kernel_version" : "4.4.0-75-generic",
"entity_id" : "admin"
},
"completed_requests" : 0,
"num_leases" : 0,
"inst" : "client.1069714 10.0.0.111:0/1876172355"
},
{
"replay_requests" : 0,
"reconnecting" : false,
"num_caps" : 1726,
"state" : "open",
"inst" : "client.8394 10.0.0.103:0/3970353996",
"num_leases" : 0,
"id" : 8394,
"client_metadata" : {
"entity_id" : "admin",
"kernel_version" : "4.4.0-75-generic",
"hostname" : "suppressed"
},
"completed_requests" : 0
}
Surprisingly, the 2 hosts that were holding 2M+ caps were the ones not in use. Cephfs was mounted but nothing was using the dirs.
I did mount -o remount cephfs on those 2 hosts and, after that, caps dropped significantly to less than 300k.
I did mount -o remount cephfs on those 2 hosts and, after that, caps dropped significantly to less than 300k.
"caps": 288489
So, questions: does that really matter? What are possible impacts? What could have caused this 2 hosts to hold so many capabilities?
1 of the hosts are for tests purposes, traffic is close to zero. The other host wasn't using cephfs at all. All services stopped.
:~# ceph -v
ceph version 10.2.9-4-gbeaec39 (beaec397f00491079cd74f7b9e3e10660859e26b)
~# uname -a
Linux hostname_suppressed 4.4.0-75-generic #96~14.04.1-Ubuntu SMP Thu Apr 20 11:06:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
~# dpkg -l | grep ceph
ii ceph 10.2.9-4-gbeaec39-1trusty amd64 distributed storage and file system
ii ceph-base 10.2.9-4-gbeaec39-1trusty amd64 common ceph daemon libraries and management tools
ii ceph-common 10.2.9-4-gbeaec39-1trusty amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-fs-common 10.2.9-4-gbeaec39-1trusty amd64 common utilities to mount and interact with a ceph file system
ii ceph-mds 10.2.9-4-gbeaec39-1trusty amd64 metadata server for the ceph distributed file system
ii ceph-mon 10.2.9-4-gbeaec39-1trusty amd64 monitor server for the ceph storage system
ii ceph-osd 10.2.9-4-gbeaec39-1trusty amd64 OSD server for the ceph storage system
ii libcephfs1 10.2.9-4-gbeaec39-1trusty amd64 Ceph distributed file system client library
ii python-cephfs 10.2.9-4-gbeaec39-1trusty amd64 Python libraries for the Ceph libcephfs library
Regards,
Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com