On Fri, May 25, 2018 at 4:28 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: > I found some memory leak. could you please try > https://github.com/ceph/ceph/pull/22240 > the leak only affects multiple active mds, I think it's unrelated to your issue. > > On Fri, May 25, 2018 at 1:49 PM, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote: >> Here the result: >> >> >> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net flush journal >> { >> "message": "", >> "return_code": 0 >> } >> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net config set mds_cache_size 10000 >> { >> "success": "mds_cache_size = '10000' (not observed, change may require restart) " >> } >> >> wait ... >> >> >> root@ceph4-2:~# ceph tell mds.ceph4-2.odiso.net heap stats >> 2018-05-25 07:44:02.185911 7f4cad7fa700 0 client.50748489 ms_handle_reset on 10.5.0.88:6804/994206868 >> 2018-05-25 07:44:02.196160 7f4cae7fc700 0 client.50792764 ms_handle_reset on 10.5.0.88:6804/994206868 >> mds.ceph4-2.odiso.net tcmalloc heap stats:------------------------------------------------ >> MALLOC: 13175782328 (12565.4 MiB) Bytes in use by application >> MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist >> MALLOC: + 1774628488 ( 1692.4 MiB) Bytes in central cache freelist >> MALLOC: + 34274608 ( 32.7 MiB) Bytes in transfer cache freelist >> MALLOC: + 57260176 ( 54.6 MiB) Bytes in thread cache freelists >> MALLOC: + 120582336 ( 115.0 MiB) Bytes in malloc metadata >> MALLOC: ------------ >> MALLOC: = 15162527936 (14460.1 MiB) Actual memory used (physical + swap) >> MALLOC: + 4974067712 ( 4743.6 MiB) Bytes released to OS (aka unmapped) >> MALLOC: ------------ >> MALLOC: = 20136595648 (19203.8 MiB) Virtual address space used >> MALLOC: >> MALLOC: 1852388 Spans in use >> MALLOC: 18 Thread heaps in use >> MALLOC: 8192 Tcmalloc page size >> ------------------------------------------------ >> Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). >> Bytes released to the OS take up virtual address space but no physical memory. >> >> >> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net config set mds_cache_size 0 >> { >> "success": "mds_cache_size = '0' (not observed, change may require restart) " >> } >> >> ----- Mail original ----- >> De: "Zheng Yan" <ukernel@xxxxxxxxx> >> À: "aderumier" <aderumier@xxxxxxxxx> >> Envoyé: Vendredi 25 Mai 2018 05:56:31 >> Objet: Re: ceph mds memory usage 20GB : is it normal ? >> >> On Thu, May 24, 2018 at 11:34 PM, Alexandre DERUMIER >> <aderumier@xxxxxxxxx> wrote: >>>>>Still don't find any clue. Does the cephfs have idle period. If it >>>>>has, could you decrease mds's cache size and check what happens. For >>>>>example, run following commands during the old period. >>> >>>>>ceph daemon mds.xx flush journal >>>>>ceph daemon mds.xx config set mds_cache_size 10000; >>>>>"wait a minute" >>>>>ceph tell mds.xx heap stats >>>>>ceph daemon mds.xx config set mds_cache_size 0 >>> >>> ok thanks. I'll try this night. >>> >>> I have already mds_cache_memory_limit = 5368709120, >>> >>> does it need to remove it first before setting mds_cache_size 10000 ? >> >> no >>> >>> >>> >>> >>> ----- Mail original ----- >>> De: "Zheng Yan" <ukernel@xxxxxxxxx> >>> À: "aderumier" <aderumier@xxxxxxxxx> >>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> >>> Envoyé: Jeudi 24 Mai 2018 16:27:21 >>> Objet: Re: ceph mds memory usage 20GB : is it normal ? >>> >>> On Thu, May 24, 2018 at 7:22 PM, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote: >>>> Thanks! >>>> >>>> >>>> here the profile.pdf >>>> >>>> 10-15min profiling, I can't do it longer because my clients where lagging. >>>> >>>> but I think it should be enough to observe the rss memory increase. >>>> >>>> >>> >>> Still don't find any clue. Does the cephfs have idle period. If it >>> has, could you decrease mds's cache size and check what happens. For >>> example, run following commands during the old period. >>> >>> ceph daemon mds.xx flush journal >>> ceph daemon mds.xx config set mds_cache_size 10000; >>> "wait a minute" >>> ceph tell mds.xx heap stats >>> ceph daemon mds.xx config set mds_cache_size 0 >>> >>> >>>> >>>> >>>> ----- Mail original ----- >>>> De: "Zheng Yan" <ukernel@xxxxxxxxx> >>>> À: "aderumier" <aderumier@xxxxxxxxx> >>>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> >>>> Envoyé: Jeudi 24 Mai 2018 11:34:20 >>>> Objet: Re: ceph mds memory usage 20GB : is it normal ? >>>> >>>> On Tue, May 22, 2018 at 3:11 PM, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote: >>>>> Hi,some new stats, mds memory is not 16G, >>>>> >>>>> I have almost same number of items and bytes in cache vs some weeks ago when mds was using 8G. (ceph 12.2.5) >>>>> >>>>> >>>>> root@ceph4-2:~# while sleep 1; do ceph daemon mds.ceph4-2.odiso.net perf dump | jq '.mds_mem.rss'; ceph daemon mds.ceph4-2.odiso.net dump_mempools | jq -c '.mds_co'; done >>>>> 16905052 >>>>> {"items":43350988,"bytes":5257428143} >>>>> 16905052 >>>>> {"items":43428329,"bytes":5283850173} >>>>> 16905052 >>>>> {"items":43209167,"bytes":5208578149} >>>>> 16905052 >>>>> {"items":43177631,"bytes":5198833577} >>>>> 16905052 >>>>> {"items":43312734,"bytes":5252649462} >>>>> 16905052 >>>>> {"items":43355753,"bytes":5277197972} >>>>> 16905052 >>>>> {"items":43700693,"bytes":5303376141} >>>>> 16905052 >>>>> {"items":43115809,"bytes":5156628138} >>>>> ^C >>>>> >>>>> >>>>> >>>>> >>>>> root@ceph4-2:~# ceph status >>>>> cluster: >>>>> id: e22b8e83-3036-4fe5-8fd5-5ce9d539beca >>>>> health: HEALTH_OK >>>>> >>>>> services: >>>>> mon: 3 daemons, quorum ceph4-1,ceph4-2,ceph4-3 >>>>> mgr: ceph4-1.odiso.net(active), standbys: ceph4-2.odiso.net, ceph4-3.odiso.net >>>>> mds: cephfs4-1/1/1 up {0=ceph4-2.odiso.net=up:active}, 2 up:standby >>>>> osd: 18 osds: 18 up, 18 in >>>>> rgw: 3 daemons active >>>>> >>>>> data: >>>>> pools: 11 pools, 1992 pgs >>>>> objects: 75677k objects, 6045 GB >>>>> usage: 20579 GB used, 6246 GB / 26825 GB avail >>>>> pgs: 1992 active+clean >>>>> >>>>> io: >>>>> client: 14441 kB/s rd, 2550 kB/s wr, 371 op/s rd, 95 op/s wr >>>>> >>>>> >>>>> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net cache status >>>>> { >>>>> "pool": { >>>>> "items": 44523608, >>>>> "bytes": 5326049009 >>>>> } >>>>> } >>>>> >>>>> >>>>> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net perf dump >>>>> { >>>>> "AsyncMessenger::Worker-0": { >>>>> "msgr_recv_messages": 798876013, >>>>> "msgr_send_messages": 825999506, >>>>> "msgr_recv_bytes": 7003223097381, >>>>> "msgr_send_bytes": 691501283744, >>>>> "msgr_created_connections": 148, >>>>> "msgr_active_connections": 146, >>>>> "msgr_running_total_time": 39914.832387470, >>>>> "msgr_running_send_time": 13744.704199430, >>>>> "msgr_running_recv_time": 32342.160588451, >>>>> "msgr_running_fast_dispatch_time": 5996.336446782 >>>>> }, >>>>> "AsyncMessenger::Worker-1": { >>>>> "msgr_recv_messages": 429668771, >>>>> "msgr_send_messages": 414760220, >>>>> "msgr_recv_bytes": 5003149410825, >>>>> "msgr_send_bytes": 396281427789, >>>>> "msgr_created_connections": 132, >>>>> "msgr_active_connections": 132, >>>>> "msgr_running_total_time": 23644.410515392, >>>>> "msgr_running_send_time": 7669.068710688, >>>>> "msgr_running_recv_time": 19751.610043696, >>>>> "msgr_running_fast_dispatch_time": 4331.023453385 >>>>> }, >>>>> "AsyncMessenger::Worker-2": { >>>>> "msgr_recv_messages": 1312910919, >>>>> "msgr_send_messages": 1260040403, >>>>> "msgr_recv_bytes": 5330386980976, >>>>> "msgr_send_bytes": 3341965016878, >>>>> "msgr_created_connections": 143, >>>>> "msgr_active_connections": 138, >>>>> "msgr_running_total_time": 61696.635450100, >>>>> "msgr_running_send_time": 23491.027014598, >>>>> "msgr_running_recv_time": 53858.409319734, >>>>> "msgr_running_fast_dispatch_time": 4312.451966809 >>>>> }, >>>>> "finisher-PurgeQueue": { >>>>> "queue_len": 0, >>>>> "complete_latency": { >>>>> "avgcount": 1889416, >>>>> "sum": 29224.227703697, >>>>> "avgtime": 0.015467333 >>>>> } >>>>> }, >>>>> "mds": { >>>>> "request": 1822420924, >>>>> "reply": 1822420886, >>>>> "reply_latency": { >>>>> "avgcount": 1822420886, >>>>> "sum": 5258467.616943274, >>>>> "avgtime": 0.002885429 >>>>> }, >>>>> "forward": 0, >>>>> "dir_fetch": 116035485, >>>>> "dir_commit": 1865012, >>>>> "dir_split": 17, >>>>> "dir_merge": 24, >>>>> "inode_max": 2147483647, >>>>> "inodes": 1600438, >>>>> "inodes_top": 210492, >>>>> "inodes_bottom": 100560, >>>>> "inodes_pin_tail": 1289386, >>>>> "inodes_pinned": 1299735, >>>>> "inodes_expired": 22223476046, >>>>> "inodes_with_caps": 1299137, >>>>> "caps": 2211546, >>>>> "subtrees": 2, >>>>> "traverse": 1953482456, >>>>> "traverse_hit": 1127647211, >>>>> "traverse_forward": 0, >>>>> "traverse_discover": 0, >>>>> "traverse_dir_fetch": 105833969, >>>>> "traverse_remote_ino": 31686, >>>>> "traverse_lock": 4344, >>>>> "load_cent": 182244014474, >>>>> "q": 104, >>>>> "exported": 0, >>>>> "exported_inodes": 0, >>>>> "imported": 0, >>>>> "imported_inodes": 0 >>>>> }, >>>>> "mds_cache": { >>>>> "num_strays": 14980, >>>>> "num_strays_delayed": 7, >>>>> "num_strays_enqueuing": 0, >>>>> "strays_created": 1672815, >>>>> "strays_enqueued": 1659514, >>>>> "strays_reintegrated": 666, >>>>> "strays_migrated": 0, >>>>> "num_recovering_processing": 0, >>>>> "num_recovering_enqueued": 0, >>>>> "num_recovering_prioritized": 0, >>>>> "recovery_started": 2, >>>>> "recovery_completed": 2, >>>>> "ireq_enqueue_scrub": 0, >>>>> "ireq_exportdir": 0, >>>>> "ireq_flush": 0, >>>>> "ireq_fragmentdir": 41, >>>>> "ireq_fragstats": 0, >>>>> "ireq_inodestats": 0 >>>>> }, >>>>> "mds_log": { >>>>> "evadd": 357717092, >>>>> "evex": 357717106, >>>>> "evtrm": 357716741, >>>>> "ev": 105198, >>>>> "evexg": 0, >>>>> "evexd": 365, >>>>> "segadd": 437124, >>>>> "segex": 437124, >>>>> "segtrm": 437123, >>>>> "seg": 130, >>>>> "segexg": 0, >>>>> "segexd": 1, >>>>> "expos": 6916004026339, >>>>> "wrpos": 6916179441942, >>>>> "rdpos": 6319502327537, >>>>> "jlat": { >>>>> "avgcount": 59071693, >>>>> "sum": 120823.311894779, >>>>> "avgtime": 0.002045367 >>>>> }, >>>>> "replayed": 104847 >>>>> }, >>>>> "mds_mem": { >>>>> "ino": 1599422, >>>>> "ino+": 22152405695, >>>>> "ino-": 22150806273, >>>>> "dir": 256943, >>>>> "dir+": 18460298, >>>>> "dir-": 18203355, >>>>> "dn": 1600689, >>>>> "dn+": 22227888283, >>>>> "dn-": 22226287594, >>>>> "cap": 2211546, >>>>> "cap+": 1674287476, >>>>> "cap-": 1672075930, >>>>> "rss": 16905052, >>>>> "heap": 313916, >>>>> "buf": 0 >>>>> }, >>>>> "mds_server": { >>>>> "dispatch_client_request": 1964131912, >>>>> "dispatch_server_request": 0, >>>>> "handle_client_request": 1822420924, >>>>> "handle_client_session": 15557609, >>>>> "handle_slave_request": 0, >>>>> "req_create": 4116952, >>>>> "req_getattr": 18696543, >>>>> "req_getfilelock": 0, >>>>> "req_link": 6625, >>>>> "req_lookup": 1425824734, >>>>> "req_lookuphash": 0, >>>>> "req_lookupino": 0, >>>>> "req_lookupname": 8703, >>>>> "req_lookupparent": 0, >>>>> "req_lookupsnap": 0, >>>>> "req_lssnap": 0, >>>>> "req_mkdir": 371878, >>>>> "req_mknod": 0, >>>>> "req_mksnap": 0, >>>>> "req_open": 351119806, >>>>> "req_readdir": 17103599, >>>>> "req_rename": 2437529, >>>>> "req_renamesnap": 0, >>>>> "req_rmdir": 78789, >>>>> "req_rmsnap": 0, >>>>> "req_rmxattr": 0, >>>>> "req_setattr": 4547650, >>>>> "req_setdirlayout": 0, >>>>> "req_setfilelock": 633219, >>>>> "req_setlayout": 0, >>>>> "req_setxattr": 2, >>>>> "req_symlink": 2520, >>>>> "req_unlink": 1589288 >>>>> }, >>>>> "mds_sessions": { >>>>> "session_count": 321, >>>>> "session_add": 383, >>>>> "session_remove": 62 >>>>> }, >>>>> "objecter": { >>>>> "op_active": 0, >>>>> "op_laggy": 0, >>>>> "op_send": 197932443, >>>>> "op_send_bytes": 605992324653, >>>>> "op_resend": 22, >>>>> "op_reply": 197932421, >>>>> "op": 197932421, >>>>> "op_r": 116256030, >>>>> "op_w": 81676391, >>>>> "op_rmw": 0, >>>>> "op_pg": 0, >>>>> "osdop_stat": 1518341, >>>>> "osdop_create": 4314348, >>>>> "osdop_read": 79810, >>>>> "osdop_write": 59151421, >>>>> "osdop_writefull": 237358, >>>>> "osdop_writesame": 0, >>>>> "osdop_append": 0, >>>>> "osdop_zero": 2, >>>>> "osdop_truncate": 9, >>>>> "osdop_delete": 2320714, >>>>> "osdop_mapext": 0, >>>>> "osdop_sparse_read": 0, >>>>> "osdop_clonerange": 0, >>>>> "osdop_getxattr": 29426577, >>>>> "osdop_setxattr": 8628696, >>>>> "osdop_cmpxattr": 0, >>>>> "osdop_rmxattr": 0, >>>>> "osdop_resetxattrs": 0, >>>>> "osdop_tmap_up": 0, >>>>> "osdop_tmap_put": 0, >>>>> "osdop_tmap_get": 0, >>>>> "osdop_call": 0, >>>>> "osdop_watch": 0, >>>>> "osdop_notify": 0, >>>>> "osdop_src_cmpxattr": 0, >>>>> "osdop_pgls": 0, >>>>> "osdop_pgls_filter": 0, >>>>> "osdop_other": 13551599, >>>>> "linger_active": 0, >>>>> "linger_send": 0, >>>>> "linger_resend": 0, >>>>> "linger_ping": 0, >>>>> "poolop_active": 0, >>>>> "poolop_send": 0, >>>>> "poolop_resend": 0, >>>>> "poolstat_active": 0, >>>>> "poolstat_send": 0, >>>>> "poolstat_resend": 0, >>>>> "statfs_active": 0, >>>>> "statfs_send": 0, >>>>> "statfs_resend": 0, >>>>> "command_active": 0, >>>>> "command_send": 0, >>>>> "command_resend": 0, >>>>> "map_epoch": 3907, >>>>> "map_full": 0, >>>>> "map_inc": 601, >>>>> "osd_sessions": 18, >>>>> "osd_session_open": 20, >>>>> "osd_session_close": 2, >>>>> "osd_laggy": 0, >>>>> "omap_wr": 3595801, >>>>> "omap_rd": 232070972, >>>>> "omap_del": 272598 >>>>> }, >>>>> "purge_queue": { >>>>> "pq_executing_ops": 0, >>>>> "pq_executing": 0, >>>>> "pq_executed": 1659514 >>>>> }, >>>>> "throttle-msgr_dispatch_throttler-mds": { >>>>> "val": 0, >>>>> "max": 104857600, >>>>> "get_started": 0, >>>>> "get": 2541455703, >>>>> "get_sum": 17148691767160, >>>>> "get_or_fail_fail": 0, >>>>> "get_or_fail_success": 2541455703, >>>>> "take": 0, >>>>> "take_sum": 0, >>>>> "put": 2541455703, >>>>> "put_sum": 17148691767160, >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "throttle-objecter_bytes": { >>>>> "val": 0, >>>>> "max": 104857600, >>>>> "get_started": 0, >>>>> "get": 0, >>>>> "get_sum": 0, >>>>> "get_or_fail_fail": 0, >>>>> "get_or_fail_success": 0, >>>>> "take": 197932421, >>>>> "take_sum": 606323353310, >>>>> "put": 182060027, >>>>> "put_sum": 606323353310, >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "throttle-objecter_ops": { >>>>> "val": 0, >>>>> "max": 1024, >>>>> "get_started": 0, >>>>> "get": 0, >>>>> "get_sum": 0, >>>>> "get_or_fail_fail": 0, >>>>> "get_or_fail_success": 0, >>>>> "take": 197932421, >>>>> "take_sum": 197932421, >>>>> "put": 197932421, >>>>> "put_sum": 197932421, >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "throttle-write_buf_throttle": { >>>>> "val": 0, >>>>> "max": 3758096384, >>>>> "get_started": 0, >>>>> "get": 1659514, >>>>> "get_sum": 154334946, >>>>> "get_or_fail_fail": 0, >>>>> "get_or_fail_success": 1659514, >>>>> "take": 0, >>>>> "take_sum": 0, >>>>> "put": 79728, >>>>> "put_sum": 154334946, >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "throttle-write_buf_throttle-0x55decea8e140": { >>>>> "val": 255839, >>>>> "max": 3758096384, >>>>> "get_started": 0, >>>>> "get": 357717092, >>>>> "get_sum": 596677113363, >>>>> "get_or_fail_fail": 0, >>>>> "get_or_fail_success": 357717092, >>>>> "take": 0, >>>>> "take_sum": 0, >>>>> "put": 59071693, >>>>> "put_sum": 596676857524, >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> } >>>>> } >>>>> >>>>> >>>> >>>> Maybe there is memory leak. What is output of 'ceph tell mds.xx heap >>>> stats'. If the RSS size keeps increasing, please run profile heap for >>>> a period of time. >>>> >>>> >>>> ceph tell mds.xx heap start_profiler >>>> "wait some time" >>>> ceph tell mds.xx heap dump >>>> ceph tell mds.xx heap stop_profiler >>>> pprof --pdf <location pf ceph-mds binary> >>>> /var/log/ceph/mds.xxx.profile.* > profile.pdf >>>> >>>> send profile.pdf to us >>>> >>>> Regards >>>> Yan, Zheng >>>> >>>>> >>>>> ----- Mail original ----- >>>>> De: "Webert de Souza Lima" <webert.boss@xxxxxxxxx> >>>>> À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> >>>>> Envoyé: Lundi 14 Mai 2018 15:14:35 >>>>> Objet: Re: ceph mds memory usage 20GB : is it normal ? >>>>> >>>>> On Sat, May 12, 2018 at 3:11 AM Alexandre DERUMIER < [ mailto:aderumier@xxxxxxxxx | aderumier@xxxxxxxxx ] > wrote: >>>>> >>>>> >>>>> The documentation (luminous) say: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>>mds cache size >>>>>> >>>>>>Description: The number of inodes to cache. A value of 0 indicates an unlimited number. It is recommended to use mds_cache_memory_limit to limit the amount of memory the MDS cache uses. >>>>>>Type: 32-bit Integer >>>>>>Default: 0 >>>>>> >>>>> >>>>> >>>>> >>>>> and, my mds_cache_memory_limit is currently at 5GB. >>>>> >>>>> >>>>> yeah I have only suggested that because the high memory usage seemed to trouble you and it might be a bug, so it's more of a workaround. >>>>> >>>>> Regards, >>>>> Webert Lima >>>>> DevOps Engineer at MAV Tecnologia >>>>> Belo Horizonte - Brasil >>>>> IRC NICK - WebertRLZ >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com