Re: Insane CPU utilization in ceph.fuse

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks!,

It's true that I've seen a continuous memory growth, but I've not thought in a memory leak. I don't remember exactly how many hours were neccesary to fill the memory, but I calculate that were about 14h.

With the new configuration looks like memory grows slowly and when it reaches 5-6 GB stops. Sometimes looks like the daemon flush the memory and down again to less than 1Gb grown again to 5-6Gb slowly.

Just today I don't know why and how, because I've not changed anything on the ceph cluster, but the memory has down to less than 1 Gb and still there 8 hours later. I've only deployed a git repository with some changes.

I've some nodes on version 12.2.5 because I've detected this problem and I didn't know if was for the latest version, so I've stopped the update. The one that is the active MDS is on latest version (12.2.7), and I've programmed an update for the rest of nodes the thursday.

A graphic of the memory usage of latest days with that configuration:
https://imgur.com/a/uSsvBi4

I haven't info about when the problem was worst (512MB of MDS memory limit and 15-16Gb of usage), because memory usage was not logged. I've only a heap stats from that were dumped when the daemon was in progress to fill the memory:

# ceph tell mds.kavehome-mgto-pro-fs01  heap stats
2018-07-19 00:43:46.142560 7f5a7a7fc700  0 client.1318388 ms_handle_reset on 10.22.0.168:6800/1129848128
2018-07-19 00:43:46.181133 7f5a7b7fe700  0 client.1318391 ms_handle_reset on 10.22.0.168:6800/1129848128
mds.kavehome-mgto-pro-fs01 tcmalloc heap stats:------------------------------------------------
MALLOC:     9982980144 ( 9520.5 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +    172148208 (  164.2 MiB) Bytes in central cache freelist
MALLOC: +     19031168 (   18.1 MiB) Bytes in transfer cache freelist
MALLOC: +     23987552 (   22.9 MiB) Bytes in thread cache freelists
MALLOC: +     20869280 (   19.9 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =  10219016352 ( 9745.6 MiB) Actual memory used (physical + swap)
MALLOC: +   3913687040 ( 3732.4 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =  14132703392 (13478.0 MiB) Virtual address space used
MALLOC:
MALLOC:          63875              Spans in use
MALLOC:             16              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.



Here's the Diff:
--------------------------------------------------------------------------------------------------------------------
{
    "diff": {
        "current": {
            "admin_socket": "/var/run/ceph/ceph-mds.kavehome-mgto-pro-fs01.asok",
            "auth_client_required": "cephx",
            "bluestore_cache_size_hdd": "80530636",
            "bluestore_cache_size_ssd": "80530636",
            "err_to_stderr": "true",
            "fsid": "f015f888-6e0c-4203-aea8-ef0f69ef7bd8",
            "internal_safe_to_start_threads": "true",
            "keyring": "/var/lib/ceph/mds/ceph-kavehome-mgto-pro-fs01/keyring",
            "log_file": "/var/log/ceph/ceph-mds.kavehome-mgto-pro-fs01.log",
            "log_max_recent": "10000",
            "log_to_stderr": "false",
            "mds_cache_memory_limit": "53687091",
            "mds_data": "/var/lib/ceph/mds/ceph-kavehome-mgto-pro-fs01",
            "mgr_data": "/var/lib/ceph/mgr/ceph-kavehome-mgto-pro-fs01",
            "mon_cluster_log_file": "default=/var/log/ceph/ceph.$channel.log cluster=/var/log/ceph/ceph.log",
            "mon_data": "/var/lib/ceph/mon/ceph-kavehome-mgto-pro-fs01",
            "mon_debug_dump_location": "/var/log/ceph/ceph-mds.kavehome-mgto-pro-fs01.tdump",
            "mon_host": "10.22.0.168,10.22.0.140,10.22.0.127",
            "mon_initial_members": "kavehome-mgto-pro-fs01, kavehome-mgto-pro-fs02, kavehome-mgto-pro-fs03",
            "osd_data": "/var/lib/ceph/osd/ceph-kavehome-mgto-pro-fs01",
            "osd_journal": "/var/lib/ceph/osd/ceph-kavehome-mgto-pro-fs01/journal",
            "public_addr": "10.22.0.168:0/0",
            "public_network": "10.22.0.0/24",
            "rgw_data": "/var/lib/ceph/radosgw/ceph-kavehome-mgto-pro-fs01",
            "setgroup": "ceph",
            "setuser": "ceph"
        },
        "defaults": {
            "admin_socket": "",
            "auth_client_required": "cephx, none",
            "bluestore_cache_size_hdd": "1073741824",
            "bluestore_cache_size_ssd": "3221225472",
            "err_to_stderr": "false",
            "fsid": "00000000-0000-0000-0000-000000000000",
            "internal_safe_to_start_threads": "false",
            "keyring": "/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,",
            "log_file": "",
            "log_max_recent": "500",
            "log_to_stderr": "true",
            "mds_cache_memory_limit": "1073741824",
            "mds_data": "/var/lib/ceph/mds/$cluster-$id",
            "mgr_data": "/var/lib/ceph/mgr/$cluster-$id",
            "mon_cluster_log_file": "default=/var/log/ceph/$cluster.$channel.log cluster=/var/log/ceph/$cluster.log",
            "mon_data": "/var/lib/ceph/mon/$cluster-$id",
            "mon_debug_dump_location": "/var/log/ceph/$cluster-$name.tdump",
            "mon_host": "",
            "mon_initial_members": "",
            "osd_data": "/var/lib/ceph/osd/$cluster-$id",
            "osd_journal": "/var/lib/ceph/osd/$cluster-$id/journal",
            "public_addr": "-",
            "public_network": "",
            "rgw_data": "/var/lib/ceph/radosgw/$cluster-$id",
            "setgroup": "",
            "setuser": ""
        }
    },
    "unknown": []
}
----------------------------------------------------------------------------------------------------------



Perf Dump
---------------------------------------------------------------------------------------------------------
{
    "AsyncMessenger::Worker-0": {
        "msgr_recv_messages": 1350895,
        "msgr_send_messages": 1593759,
        "msgr_recv_bytes": 301786293,
        "msgr_send_bytes": 341807191,
        "msgr_created_connections": 148,
        "msgr_active_connections": 45,
        "msgr_running_total_time": 119.217157290,
        "msgr_running_send_time": 39.714493374,
        "msgr_running_recv_time": 127.455260807,
        "msgr_running_fast_dispatch_time": 0.117634930
    },
    "AsyncMessenger::Worker-1": {
        "msgr_recv_messages": 2996114,
        "msgr_send_messages": 3113274,
        "msgr_recv_bytes": 804875332,
        "msgr_send_bytes": 1231962873,
        "msgr_created_connections": 151,
        "msgr_active_connections": 48,
        "msgr_running_total_time": 248.962533700,
        "msgr_running_send_time": 83.497214869,
        "msgr_running_recv_time": 547.534653813,
        "msgr_running_fast_dispatch_time": 0.125151678
    },
    "AsyncMessenger::Worker-2": {
        "msgr_recv_messages": 1793419,
        "msgr_send_messages": 2117240,
        "msgr_recv_bytes": 1425674729,
        "msgr_send_bytes": 871324466,
        "msgr_created_connections": 325,
        "msgr_active_connections": 54,
        "msgr_running_total_time": 160.001753142,
        "msgr_running_send_time": 49.679463024,
        "msgr_running_recv_time": 205.535692064,
        "msgr_running_fast_dispatch_time": 4.350479591
    },
    "finisher-PurgeQueue": {
        "queue_len": 0,
        "complete_latency": {
            "avgcount": 755,
            "sum": 0.022316252,
            "avgtime": 0.000029557
        }
    },
    "mds": {
        "request": 4942944,
        "reply": 489638,
        "reply_latency": {
            "avgcount": 489638,
            "sum": 771.955019623,
            "avgtime": 0.001576583
        },
        "forward": 4453296,
        "dir_fetch": 101036,
        "dir_commit": 3,
        "dir_split": 0,
        "dir_merge": 0,
        "inode_max": 2147483647,
        "inodes": 505,
        "inodes_top": 96,
        "inodes_bottom": 398,
        "inodes_pin_tail": 11,
        "inodes_pinned": 367,
        "inodes_expired": 1556356,
        "inodes_with_caps": 325,
        "caps": 1192,
        "subtrees": 16,
        "traverse": 4956673,
        "traverse_hit": 496867,
        "traverse_forward": 4450841,
        "traverse_discover": 166,
        "traverse_dir_fetch": 1657,
        "traverse_remote_ino": 0,
        "traverse_lock": 19,
        "load_cent": 494278118,
        "q": 0,
        "exported": 1187,
        "exported_inodes": 664127,
        "imported": 947,
        "imported_inodes": 76628
    },
    "mds_cache": {
        "num_strays": 0,
        "num_strays_delayed": 0,
        "num_strays_enqueuing": 0,
        "strays_created": 124,
        "strays_enqueued": 124,
        "strays_reintegrated": 0,
        "strays_migrated": 0,
        "num_recovering_processing": 0,
        "num_recovering_enqueued": 0,
        "num_recovering_prioritized": 0,
        "recovery_started": 0,
        "recovery_completed": 0,
        "ireq_enqueue_scrub": 0,
        "ireq_exportdir": 1189,
        "ireq_flush": 0,
        "ireq_fragmentdir": 0,
        "ireq_fragstats": 0,
        "ireq_inodestats": 0
    },
    "mds_log": {
        "evadd": 125666,
        "evex": 116984,
        "evtrm": 116984,
        "ev": 117582,
        "evexg": 0,
        "evexd": 933,
        "segadd": 138,
        "segex": 138,
        "segtrm": 138,
        "seg": 129,
        "segexg": 0,
        "segexd": 1,
        "expos": 25715287703,
        "wrpos": 25862332030,
        "rdpos": 25663431097,
        "jlat": {
            "avgcount": 23473,
            "sum": 98.111299299,
            "avgtime": 0.004179751
        },
        "replayed": 108900
    },
    "mds_mem": {
        "ino": 507,
        "ino+": 1579334,
        "ino-": 1578827,
        "dir": 312,
        "dir+": 101932,
        "dir-": 101620,
        "dn": 529,
        "dn+": 1580751,
        "dn-": 1580222,
        "cap": 1192,
        "cap+": 1825843,
        "cap-": 1824651,
        "rss": 258840,
        "heap": 313880,
        "buf": 0
    },
    "mds_server": {
        "dispatch_client_request": 5081829,
        "dispatch_server_request": 540,
        "handle_client_request": 4942944,
        "handle_client_session": 233505,
        "handle_slave_request": 846,
        "req_create": 128,
        "req_getattr": 38805,
        "req_getfilelock": 0,
        "req_link": 0,
        "req_lookup": 242216,
        "req_lookuphash": 0,
        "req_lookupino": 0,
        "req_lookupname": 2,
        "req_lookupparent": 0,
        "req_lookupsnap": 0,
        "req_lssnap": 0,
        "req_mkdir": 0,
        "req_mknod": 0,
        "req_mksnap": 0,
        "req_open": 2155,
        "req_readdir": 206315,
        "req_rename": 21,
        "req_renamesnap": 0,
        "req_rmdir": 0,
        "req_rmsnap": 0,
        "req_rmxattr": 0,
        "req_setattr": 2,
        "req_setdirlayout": 0,
        "req_setfilelock": 0,
        "req_setlayout": 0,
        "req_setxattr": 0,
        "req_symlink": 0,
        "req_unlink": 122
    },
    "mds_sessions": {
        "session_count": 10,
        "session_add": 128,
        "session_remove": 118
    },
    "objecter": {
        "op_active": 0,
        "op_laggy": 0,
        "op_send": 136767,
        "op_send_bytes": 202196534,
        "op_resend": 0,
        "op_reply": 136767,
        "op": 136767,
        "op_r": 101193,
        "op_w": 35574,
        "op_rmw": 0,
        "op_pg": 0,
        "osdop_stat": 5,
        "osdop_create": 0,
        "osdop_read": 150,
        "osdop_write": 23587,
        "osdop_writefull": 11750,
        "osdop_writesame": 0,
        "osdop_append": 0,
        "osdop_zero": 2,
        "osdop_truncate": 0,
        "osdop_delete": 228,
        "osdop_mapext": 0,
        "osdop_sparse_read": 0,
        "osdop_clonerange": 0,
        "osdop_getxattr": 100784,
        "osdop_setxattr": 0,
        "osdop_cmpxattr": 0,
        "osdop_rmxattr": 0,
        "osdop_resetxattrs": 0,
        "osdop_tmap_up": 0,
        "osdop_tmap_put": 0,
        "osdop_tmap_get": 0,
        "osdop_call": 0,
        "osdop_watch": 0,
        "osdop_notify": 0,
        "osdop_src_cmpxattr": 0,
        "osdop_pgls": 0,
        "osdop_pgls_filter": 0,
        "osdop_other": 3,
        "linger_active": 0,
        "linger_send": 0,
        "linger_resend": 0,
        "linger_ping": 0,
        "poolop_active": 0,
        "poolop_send": 0,
        "poolop_resend": 0,
        "poolstat_active": 0,
        "poolstat_send": 0,
        "poolstat_resend": 0,
        "statfs_active": 0,
        "statfs_send": 0,
        "statfs_resend": 0,
        "command_active": 0,
        "command_send": 0,
        "command_resend": 0,
        "map_epoch": 468,
        "map_full": 0,
        "map_inc": 39,
        "osd_sessions": 3,
        "osd_session_open": 479,
        "osd_session_close": 476,
        "osd_laggy": 0,
        "omap_wr": 7,
        "omap_rd": 202074,
        "omap_del": 1
    },
    "purge_queue": {
        "pq_executing_ops": 0,
        "pq_executing": 0,
        "pq_executed": 124
    },
    "throttle-msgr_dispatch_throttler-mds": {
        "val": 0,
        "max": 104857600,
        "get_started": 0,
        "get": 6140428,
        "get_sum": 2077944682,
        "get_or_fail_fail": 0,
        "get_or_fail_success": 6140428,
        "take": 0,
        "take_sum": 0,
        "put": 6140428,
        "put_sum": 2077944682,
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "throttle-objecter_bytes": {
        "val": 0,
        "max": 104857600,
        "get_started": 0,
        "get": 0,
        "get_sum": 0,
        "get_or_fail_fail": 0,
        "get_or_fail_success": 0,
        "take": 136767,
        "take_sum": 339484250,
        "put": 136523,
        "put_sum": 339484250,
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "throttle-objecter_ops": {
        "val": 0,
        "max": 1024,
        "get_started": 0,
        "get": 0,
        "get_sum": 0,
        "get_or_fail_fail": 0,
        "get_or_fail_success": 0,
        "take": 136767,
        "take_sum": 136767,
        "put": 136767,
        "put_sum": 136767,
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "throttle-write_buf_throttle": {
        "val": 0,
        "max": 3758096384,
        "get_started": 0,
        "get": 124,
        "get_sum": 11532,
        "get_or_fail_fail": 0,
        "get_or_fail_success": 124,
        "take": 0,
        "take_sum": 0,
        "put": 109,
        "put_sum": 11532,
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "throttle-write_buf_throttle-0x55faf5ba4220": {
        "val": 0,
        "max": 3758096384,
        "get_started": 0,
        "get": 125666,
        "get_sum": 198900816,
        "get_or_fail_fail": 0,
        "get_or_fail_success": 125666,
        "take": 0,
        "take_sum": 0,
        "put": 23473,
        "put_sum": 198900816,
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    }
}
----------------------------------------------------------------------------------------------



dump_mempools
----------------------------------------------------------------------------------------------
{
    "bloom_filter": {
        "items": 120,
        "bytes": 120
    },
    "bluestore_alloc": {
        "items": 0,
        "bytes": 0
    },
    "bluestore_cache_data": {
        "items": 0,
        "bytes": 0
    },
    "bluestore_cache_onode": {
        "items": 0,
        "bytes": 0
    },
    "bluestore_cache_other": {
        "items": 0,
        "bytes": 0
    },
    "bluestore_fsck": {
        "items": 0,
        "bytes": 0
    },
    "bluestore_txc": {
        "items": 0,
        "bytes": 0
    },
    "bluestore_writing_deferred": {
        "items": 0,
        "bytes": 0
    },
    "bluestore_writing": {
        "items": 0,
        "bytes": 0
    },
    "bluefs": {
        "items": 0,
        "bytes": 0
    },
    "buffer_anon": {
        "items": 96401,
        "bytes": 16010198
    },
    "buffer_meta": {
        "items": 1,
        "bytes": 88
    },
    "osd": {
        "items": 0,
        "bytes": 0
    },
    "osd_mapbl": {
        "items": 0,
        "bytes": 0
    },
    "osd_pglog": {
        "items": 0,
        "bytes": 0
    },
    "osdmap": {
        "items": 80,
        "bytes": 3296
    },
    "osdmap_mapping": {
        "items": 0,
        "bytes": 0
    },
    "pgmap": {
        "items": 0,
        "bytes": 0
    },
    "mds_co": {
        "items": 17604,
        "bytes": 2330840
    },
    "unittest_1": {
        "items": 0,
        "bytes": 0
    },
    "unittest_2": {
        "items": 0,
        "bytes": 0
    },
    "total": {
        "items": 114206,
        "bytes": 18344542
    }
}
-------------------------------------------------------------------------------------------------------------------


Sorry for my english!.


Greetings!!



El 23 jul. 2018 20:08, "Patrick Donnelly" <pdonnell@xxxxxxxxxx> escribió:
On Mon, Jul 23, 2018 at 5:48 AM, Daniel Carrasco <d.carrasco@xxxxxxxxx> wrote:
> Hi, thanks for your response.
>
> Clients are about 6, and 4 of them are the most of time on standby. Only two
> are active servers that are serving the webpage. Also we've a varnish on
> front, so are not getting all the load (below 30% in PHP is not much).
> About the MDS cache, now I've the mds_cache_memory_limit at 8Mb.

What! Please post `ceph daemon mds.<name> config diff`,  `... perf
dump`, and `... dump_mempools `  from the server the active MDS is on.


> I've tested
> also 512Mb, but the CPU usage is the same and the MDS RAM usage grows up to
> 15GB (on a 16Gb server it starts to swap and all fails). With 8Mb, at least
> the memory usage is stable on less than 6Gb (now is using about 1GB of RAM).

We've seen reports of possible memory leaks before and the potential
fixes for those were in 12.2.6. How fast does your MDS reach 15GB?
Your MDS cache size should be configured to 1-8GB (depending on your
preference) so it's disturbing to see you set it so low.


--
Patrick Donnelly

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux