Thanks!,
It's true that I've seen a continuous memory growth, but I've not thought in a memory leak. I don't remember exactly how many hours were neccesary to fill the memory, but I calculate that were about 14h.
With the new configuration looks like memory grows slowly and when it reaches 5-6 GB stops. Sometimes looks like the daemon flush the memory and down again to less than 1Gb grown again to 5-6Gb slowly.
Just today I don't know why and how, because I've not changed anything on the ceph cluster, but the memory has down to less than 1 Gb and still there 8 hours later. I've only deployed a git repository with some changes.
I've some nodes on version 12.2.5 because I've detected this problem and I didn't know if was for the latest version, so I've stopped the update. The one that is the active MDS is on latest version (12.2.7), and I've programmed an update for the rest of nodes the thursday.
A graphic of the memory usage of latest days with that configuration:
I haven't info about when the problem was worst (512MB of MDS memory limit and 15-16Gb of usage), because memory usage was not logged. I've only a heap stats from that were dumped when the daemon was in progress to fill the memory:
# ceph tell mds.kavehome-mgto-pro-fs01 heap stats
2018-07-19 00:43:46.142560 7f5a7a7fc700 0 client.1318388 ms_handle_reset on 10.22.0.168:6800/1129848128
2018-07-19 00:43:46.181133 7f5a7b7fe700 0 client.1318391 ms_handle_reset on 10.22.0.168:6800/1129848128
mds.kavehome-mgto-pro-fs01 tcmalloc heap stats:------------------------------------------------
MALLOC: 9982980144 ( 9520.5 MiB) Bytes in use by application
MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
MALLOC: + 172148208 ( 164.2 MiB) Bytes in central cache freelist
MALLOC: + 19031168 ( 18.1 MiB) Bytes in transfer cache freelist
MALLOC: + 23987552 ( 22.9 MiB) Bytes in thread cache freelists
MALLOC: + 20869280 ( 19.9 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 10219016352 ( 9745.6 MiB) Actual memory used (physical + swap)
MALLOC: + 3913687040 ( 3732.4 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 14132703392 (13478.0 MiB) Virtual address space used
MALLOC:
MALLOC: 63875 Spans in use
MALLOC: 16 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
Here's the Diff:
--------------------------------------------------------------------------------------------------------------------
{
"diff": {
"current": {
"admin_socket": "/var/run/ceph/ceph-mds.kavehome-mgto-pro-fs01.asok",
"auth_client_required": "cephx",
"bluestore_cache_size_hdd": "80530636",
"bluestore_cache_size_ssd": "80530636",
"err_to_stderr": "true",
"fsid": "f015f888-6e0c-4203-aea8-ef0f69ef7bd8",
"internal_safe_to_start_threads": "true",
"keyring": "/var/lib/ceph/mds/ceph-kavehome-mgto-pro-fs01/keyring",
"log_file": "/var/log/ceph/ceph-mds.kavehome-mgto-pro-fs01.log",
"log_max_recent": "10000",
"log_to_stderr": "false",
"mds_cache_memory_limit": "53687091",
"mds_data": "/var/lib/ceph/mds/ceph-kavehome-mgto-pro-fs01",
"mgr_data": "/var/lib/ceph/mgr/ceph-kavehome-mgto-pro-fs01",
"mon_cluster_log_file": "default=/var/log/ceph/ceph.$channel.log cluster=/var/log/ceph/ceph.log",
"mon_data": "/var/lib/ceph/mon/ceph-kavehome-mgto-pro-fs01",
"mon_debug_dump_location": "/var/log/ceph/ceph-mds.kavehome-mgto-pro-fs01.tdump",
"mon_host": "10.22.0.168,10.22.0.140,10.22.0.127",
"mon_initial_members": "kavehome-mgto-pro-fs01, kavehome-mgto-pro-fs02, kavehome-mgto-pro-fs03",
"osd_data": "/var/lib/ceph/osd/ceph-kavehome-mgto-pro-fs01",
"osd_journal": "/var/lib/ceph/osd/ceph-kavehome-mgto-pro-fs01/journal",
"public_addr": "10.22.0.168:0/0",
"public_network": "10.22.0.0/24",
"rgw_data": "/var/lib/ceph/radosgw/ceph-kavehome-mgto-pro-fs01",
"setgroup": "ceph",
"setuser": "ceph"
},
"defaults": {
"admin_socket": "",
"auth_client_required": "cephx, none",
"bluestore_cache_size_hdd": "1073741824",
"bluestore_cache_size_ssd": "3221225472",
"err_to_stderr": "false",
"fsid": "00000000-0000-0000-0000-000000000000",
"internal_safe_to_start_threads": "false",
"keyring": "/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,",
"log_file": "",
"log_max_recent": "500",
"log_to_stderr": "true",
"mds_cache_memory_limit": "1073741824",
"mds_data": "/var/lib/ceph/mds/$cluster-$id",
"mgr_data": "/var/lib/ceph/mgr/$cluster-$id",
"mon_cluster_log_file": "default=/var/log/ceph/$cluster.$channel.log cluster=/var/log/ceph/$cluster.log",
"mon_data": "/var/lib/ceph/mon/$cluster-$id",
"mon_debug_dump_location": "/var/log/ceph/$cluster-$name.tdump",
"mon_host": "",
"mon_initial_members": "",
"osd_data": "/var/lib/ceph/osd/$cluster-$id",
"osd_journal": "/var/lib/ceph/osd/$cluster-$id/journal",
"public_addr": "-",
"public_network": "",
"rgw_data": "/var/lib/ceph/radosgw/$cluster-$id",
"setgroup": "",
"setuser": ""
}
},
"unknown": []
}
----------------------------------------------------------------------------------------------------------
Perf Dump
---------------------------------------------------------------------------------------------------------
{
"AsyncMessenger::Worker-0": {
"msgr_recv_messages": 1350895,
"msgr_send_messages": 1593759,
"msgr_recv_bytes": 301786293,
"msgr_send_bytes": 341807191,
"msgr_created_connections": 148,
"msgr_active_connections": 45,
"msgr_running_total_time": 119.217157290,
"msgr_running_send_time": 39.714493374,
"msgr_running_recv_time": 127.455260807,
"msgr_running_fast_dispatch_time": 0.117634930
},
"AsyncMessenger::Worker-1": {
"msgr_recv_messages": 2996114,
"msgr_send_messages": 3113274,
"msgr_recv_bytes": 804875332,
"msgr_send_bytes": 1231962873,
"msgr_created_connections": 151,
"msgr_active_connections": 48,
"msgr_running_total_time": 248.962533700,
"msgr_running_send_time": 83.497214869,
"msgr_running_recv_time": 547.534653813,
"msgr_running_fast_dispatch_time": 0.125151678
},
"AsyncMessenger::Worker-2": {
"msgr_recv_messages": 1793419,
"msgr_send_messages": 2117240,
"msgr_recv_bytes": 1425674729,
"msgr_send_bytes": 871324466,
"msgr_created_connections": 325,
"msgr_active_connections": 54,
"msgr_running_total_time": 160.001753142,
"msgr_running_send_time": 49.679463024,
"msgr_running_recv_time": 205.535692064,
"msgr_running_fast_dispatch_time": 4.350479591
},
"finisher-PurgeQueue": {
"queue_len": 0,
"complete_latency": {
"avgcount": 755,
"sum": 0.022316252,
"avgtime": 0.000029557
}
},
"mds": {
"request": 4942944,
"reply": 489638,
"reply_latency": {
"avgcount": 489638,
"sum": 771.955019623,
"avgtime": 0.001576583
},
"forward": 4453296,
"dir_fetch": 101036,
"dir_commit": 3,
"dir_split": 0,
"dir_merge": 0,
"inode_max": 2147483647,
"inodes": 505,
"inodes_top": 96,
"inodes_bottom": 398,
"inodes_pin_tail": 11,
"inodes_pinned": 367,
"inodes_expired": 1556356,
"inodes_with_caps": 325,
"caps": 1192,
"subtrees": 16,
"traverse": 4956673,
"traverse_hit": 496867,
"traverse_forward": 4450841,
"traverse_discover": 166,
"traverse_dir_fetch": 1657,
"traverse_remote_ino": 0,
"traverse_lock": 19,
"load_cent": 494278118,
"q": 0,
"exported": 1187,
"exported_inodes": 664127,
"imported": 947,
"imported_inodes": 76628
},
"mds_cache": {
"num_strays": 0,
"num_strays_delayed": 0,
"num_strays_enqueuing": 0,
"strays_created": 124,
"strays_enqueued": 124,
"strays_reintegrated": 0,
"strays_migrated": 0,
"num_recovering_processing": 0,
"num_recovering_enqueued": 0,
"num_recovering_prioritized": 0,
"recovery_started": 0,
"recovery_completed": 0,
"ireq_enqueue_scrub": 0,
"ireq_exportdir": 1189,
"ireq_flush": 0,
"ireq_fragmentdir": 0,
"ireq_fragstats": 0,
"ireq_inodestats": 0
},
"mds_log": {
"evadd": 125666,
"evex": 116984,
"evtrm": 116984,
"ev": 117582,
"evexg": 0,
"evexd": 933,
"segadd": 138,
"segex": 138,
"segtrm": 138,
"seg": 129,
"segexg": 0,
"segexd": 1,
"expos": 25715287703,
"wrpos": 25862332030,
"rdpos": 25663431097,
"jlat": {
"avgcount": 23473,
"sum": 98.111299299,
"avgtime": 0.004179751
},
"replayed": 108900
},
"mds_mem": {
"ino": 507,
"ino+": 1579334,
"ino-": 1578827,
"dir": 312,
"dir+": 101932,
"dir-": 101620,
"dn": 529,
"dn+": 1580751,
"dn-": 1580222,
"cap": 1192,
"cap+": 1825843,
"cap-": 1824651,
"rss": 258840,
"heap": 313880,
"buf": 0
},
"mds_server": {
"dispatch_client_request": 5081829,
"dispatch_server_request": 540,
"handle_client_request": 4942944,
"handle_client_session": 233505,
"handle_slave_request": 846,
"req_create": 128,
"req_getattr": 38805,
"req_getfilelock": 0,
"req_link": 0,
"req_lookup": 242216,
"req_lookuphash": 0,
"req_lookupino": 0,
"req_lookupname": 2,
"req_lookupparent": 0,
"req_lookupsnap": 0,
"req_lssnap": 0,
"req_mkdir": 0,
"req_mknod": 0,
"req_mksnap": 0,
"req_open": 2155,
"req_readdir": 206315,
"req_rename": 21,
"req_renamesnap": 0,
"req_rmdir": 0,
"req_rmsnap": 0,
"req_rmxattr": 0,
"req_setattr": 2,
"req_setdirlayout": 0,
"req_setfilelock": 0,
"req_setlayout": 0,
"req_setxattr": 0,
"req_symlink": 0,
"req_unlink": 122
},
"mds_sessions": {
"session_count": 10,
"session_add": 128,
"session_remove": 118
},
"objecter": {
"op_active": 0,
"op_laggy": 0,
"op_send": 136767,
"op_send_bytes": 202196534,
"op_resend": 0,
"op_reply": 136767,
"op": 136767,
"op_r": 101193,
"op_w": 35574,
"op_rmw": 0,
"op_pg": 0,
"osdop_stat": 5,
"osdop_create": 0,
"osdop_read": 150,
"osdop_write": 23587,
"osdop_writefull": 11750,
"osdop_writesame": 0,
"osdop_append": 0,
"osdop_zero": 2,
"osdop_truncate": 0,
"osdop_delete": 228,
"osdop_mapext": 0,
"osdop_sparse_read": 0,
"osdop_clonerange": 0,
"osdop_getxattr": 100784,
"osdop_setxattr": 0,
"osdop_cmpxattr": 0,
"osdop_rmxattr": 0,
"osdop_resetxattrs": 0,
"osdop_tmap_up": 0,
"osdop_tmap_put": 0,
"osdop_tmap_get": 0,
"osdop_call": 0,
"osdop_watch": 0,
"osdop_notify": 0,
"osdop_src_cmpxattr": 0,
"osdop_pgls": 0,
"osdop_pgls_filter": 0,
"osdop_other": 3,
"linger_active": 0,
"linger_send": 0,
"linger_resend": 0,
"linger_ping": 0,
"poolop_active": 0,
"poolop_send": 0,
"poolop_resend": 0,
"poolstat_active": 0,
"poolstat_send": 0,
"poolstat_resend": 0,
"statfs_active": 0,
"statfs_send": 0,
"statfs_resend": 0,
"command_active": 0,
"command_send": 0,
"command_resend": 0,
"map_epoch": 468,
"map_full": 0,
"map_inc": 39,
"osd_sessions": 3,
"osd_session_open": 479,
"osd_session_close": 476,
"osd_laggy": 0,
"omap_wr": 7,
"omap_rd": 202074,
"omap_del": 1
},
"purge_queue": {
"pq_executing_ops": 0,
"pq_executing": 0,
"pq_executed": 124
},
"throttle-msgr_dispatch_throttler-mds": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 6140428,
"get_sum": 2077944682,
"get_or_fail_fail": 0,
"get_or_fail_success": 6140428,
"take": 0,
"take_sum": 0,
"put": 6140428,
"put_sum": 2077944682,
"wait": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
}
},
"throttle-objecter_bytes": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 136767,
"take_sum": 339484250,
"put": 136523,
"put_sum": 339484250,
"wait": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
}
},
"throttle-objecter_ops": {
"val": 0,
"max": 1024,
"get_started": 0,
"get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 136767,
"take_sum": 136767,
"put": 136767,
"put_sum": 136767,
"wait": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
}
},
"throttle-write_buf_throttle": {
"val": 0,
"max": 3758096384,
"get_started": 0,
"get": 124,
"get_sum": 11532,
"get_or_fail_fail": 0,
"get_or_fail_success": 124,
"take": 0,
"take_sum": 0,
"put": 109,
"put_sum": 11532,
"wait": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
}
},
"throttle-write_buf_throttle-0x55faf5ba4220": {
"val": 0,
"max": 3758096384,
"get_started": 0,
"get": 125666,
"get_sum": 198900816,
"get_or_fail_fail": 0,
"get_or_fail_success": 125666,
"take": 0,
"take_sum": 0,
"put": 23473,
"put_sum": 198900816,
"wait": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
}
}
}
----------------------------------------------------------------------------------------------
dump_mempools
----------------------------------------------------------------------------------------------
{
"bloom_filter": {
"items": 120,
"bytes": 120
},
"bluestore_alloc": {
"items": 0,
"bytes": 0
},
"bluestore_cache_data": {
"items": 0,
"bytes": 0
},
"bluestore_cache_onode": {
"items": 0,
"bytes": 0
},
"bluestore_cache_other": {
"items": 0,
"bytes": 0
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 0,
"bytes": 0
},
"bluestore_writing_deferred": {
"items": 0,
"bytes": 0
},
"bluestore_writing": {
"items": 0,
"bytes": 0
},
"bluefs": {
"items": 0,
"bytes": 0
},
"buffer_anon": {
"items": 96401,
"bytes": 16010198
},
"buffer_meta": {
"items": 1,
"bytes": 88
},
"osd": {
"items": 0,
"bytes": 0
},
"osd_mapbl": {
"items": 0,
"bytes": 0
},
"osd_pglog": {
"items": 0,
"bytes": 0
},
"osdmap": {
"items": 80,
"bytes": 3296
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 17604,
"bytes": 2330840
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
},
"total": {
"items": 114206,
"bytes": 18344542
}
}
-------------------------------------------------------------------------------------------------------------------
Sorry for my english!.
Greetings!!
El 23 jul. 2018 20:08, "Patrick Donnelly" <pdonnell@xxxxxxxxxx> escribió:
On Mon, Jul 23, 2018 at 5:48 AM, Daniel Carrasco <d.carrasco@xxxxxxxxx> wrote:What! Please post `ceph daemon mds.<name> config diff`, `... perf
> Hi, thanks for your response.
>
> Clients are about 6, and 4 of them are the most of time on standby. Only two
> are active servers that are serving the webpage. Also we've a varnish on
> front, so are not getting all the load (below 30% in PHP is not much).
> About the MDS cache, now I've the mds_cache_memory_limit at 8Mb.
dump`, and `... dump_mempools ` from the server the active MDS is on.We've seen reports of possible memory leaks before and the potential
> I've tested
> also 512Mb, but the CPU usage is the same and the MDS RAM usage grows up to
> 15GB (on a 16Gb server it starts to swap and all fails). With 8Mb, at least
> the memory usage is stable on less than 6Gb (now is using about 1GB of RAM).
fixes for those were in 12.2.6. How fast does your MDS reach 15GB?
Your MDS cache size should be configured to 1-8GB (depending on your
preference) so it's disturbing to see you set it so low.
--
Patrick Donnelly
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com