I'm not sure this is a cache issue. To me, this feels like a memory leak. I'm now at 129GB (haven't had a window to upgrade yet) on a configured 80GB cache.
[root@mds0 ceph-admin]# ceph daemon mds.mds0 cache status
{
"pool": {
"items": 166753076,
"bytes": 71766944952
}
}
ran a 10 minute heap profile.
[root@mds0 ceph-admin]# ceph tell mds.mds0 heap start_profiler
2018-05-25 08:15:04.428519 7f3f657fa700 0 client.127046191 ms_handle_reset on 10.124.103.50:6800/2248223690
2018-05-25 08:15:04.447528 7f3f667fc700 0 client.127055541 ms_handle_reset on 10.124.103.50:6800/2248223690
mds.mds0 started profiler
[root@mds0 ceph-admin]# ceph tell mds.mds0 heap dump
2018-05-25 08:25:14.265450 7f1774ff9700 0 client.127057266 ms_handle_reset on 10.124.103.50:6800/2248223690
2018-05-25 08:25:14.356292 7f1775ffb700 0 client.127057269 ms_handle_reset on 10.124.103.50:6800/2248223690
mds.mds0 dumping heap profile now.
------------------------------------------------
MALLOC: 123658130320 (117929.6 MiB) Bytes in use by application
MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
MALLOC: + 6969713096 ( 6646.8 MiB) Bytes in central cache freelist
MALLOC: + 26700832 ( 25.5 MiB) Bytes in transfer cache freelist
MALLOC: + 54460040 ( 51.9 MiB) Bytes in thread cache freelists
MALLOC: + 531034272 ( 506.4 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 131240038560 (125160.3 MiB) Actual memory used (physical + swap)
MALLOC: + 7426875392 ( 7082.8 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 138666913952 (132243.1 MiB) Virtual address space used
MALLOC:
MALLOC: 7434952 Spans in use
MALLOC: 20 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
[root@mds0 ceph-admin]# ceph tell mds.mds0 heap stop_profiler
2018-05-25 08:25:26.394877 7fbe48ff9700 0 client.127047898 ms_handle_reset on 10.124.103.50:6800/2248223690
2018-05-25 08:25:26.736909 7fbe49ffb700 0 client.127035608 ms_handle_reset on 10.124.103.50:6800/2248223690
mds.mds0 stopped profiler
[root@mds0 ceph-admin]# pprof --pdf /bin/ceph-mds /var/log/ceph/mds.mds0.profile.000* > profile.pdf
On Thu, May 10, 2018 at 2:11 PM, Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
On Thu, May 10, 2018 at 12:00 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:
> [ceph-admin@mds0 ~]$ ps aux | grep ceph-mds
> ceph 1841 3.5 94.3 133703308 124425384 ? Ssl Apr04 1808:32
> /usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup ceph
>
>
> [ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status
> {
> "pool": {
> "items": 173261056,
> "bytes": 76504108600
> }
> }
>
> So, 80GB is my configured limit for the cache and it appears the mds is
> following that limit. But, the mds process is using over 100GB RAM in my
> 128GB host. I thought I was playing it safe by configuring at 80. What other
> things consume a lot of RAM for this process?
>
> Let me know if I need to create a new thread.
The cache size measurement is imprecise pre-12.2.5 [1]. You should upgrade ASAP.
[1] https://tracker.ceph.com/issues/22972
--
Patrick Donnelly
Attachment:
profile.pdf
Description: Adobe PDF document
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com