On Sun, Apr 12, 2020 at 9:33 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > Hi John, > > Did you make any progress on investigating this? > > Today I also saw huge relative buffer_anon usage on our 2 active mds's > running 14.2.8: > > "mempool": { > "by_pool": { > "bloom_filter": { > "items": 2322, > "bytes": 2322 > }, > ... > "buffer_anon": { > "items": 4947214, > "bytes": 19785847411 > }, > ... > "osdmap": { > "items": 4036, > "bytes": 89488 > }, > ... > "mds_co": { > "items": 9248718, > "bytes": 157725128 > }, > ... > }, > "total": { > "items": 14202290, > "bytes": 19943664349 > } > } > > That mds has `mds cache memory limit = 15353442304` and there was no > health warning about the mds memory usage exceeding the limit. > (I only noticed because some other crons on the mds's were going oom). > > Patrick: is there any known memory leak in nautilus mds's ? I restarted one MDS with ms_type = simple and that MDS maintained a normal amount of buffer_anon for several hours, while the other active MDS (with async ms type) saw its buffer_anon grow by some ~10GB overnight. So, it seems there are still memory leaks with ms_type = async in 14.2.8. OTOH, the whole cluster is kinda broken now due to https://tracker.ceph.com/issues/45080, which may be related to the ms_type=simple .. I'm still debugging. Cheers, Dan > Any tips to debug this further? > > Cheers, Dan > > On Wed, Mar 4, 2020 at 8:38 PM John Madden <jmadden.com@xxxxxxxxx> wrote: > > > > Though it appears potentially(?) better, I'm still having issues with > > this on 14.2.8. Kick off the ~20 threads sequentially reading ~1M > > files and buffer_anon still grows apparently without bound. > > > > mds.1 tcmalloc heap stats:------------------------------------------------ > > MALLOC: 53710413656 (51222.2 MiB) Bytes in use by application > > MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist > > MALLOC: + 334028128 ( 318.6 MiB) Bytes in central cache freelist > > MALLOC: + 11210608 ( 10.7 MiB) Bytes in transfer cache freelist > > MALLOC: + 11105240 ( 10.6 MiB) Bytes in thread cache freelists > > MALLOC: + 77525152 ( 73.9 MiB) Bytes in malloc metadata > > MALLOC: ------------ > > MALLOC: = 54144282784 (51636.0 MiB) Actual memory used (physical + swap) > > MALLOC: + 49963008 ( 47.6 MiB) Bytes released to OS (aka unmapped) > > MALLOC: ------------ > > MALLOC: = 54194245792 (51683.7 MiB) Virtual address space used > > MALLOC: > > MALLOC: 262021 Spans in use > > MALLOC: 18 Thread heaps in use > > MALLOC: 8192 Tcmalloc page size > > ------------------------------------------------ > > > > The byte count appears to grow even as the item count drops, though > > the trend is for both to increase over the life of the workload: > > ceph daemon mds.1 dump_mempools | jq .mempool.by_pool.buffer_anon: > > > > { > > "items": 28045, > > "bytes": 24197601109 > > } > > { > > "items": 27132, > > "bytes": 24262495865 > > } > > { > > "items": 27105, > > "bytes": 24262537939 > > } > > { > > "items": 33309, > > "bytes": 29754507505 > > } > > { > > "items": 36160, > > "bytes": 31803033733 > > } > > { > > "items": 56772, > > "bytes": 51062350351 > > } > > > > Is there further data/debug I can retrieve to help track this down? > > > > > > On Wed, Feb 19, 2020 at 4:38 PM John Madden <jmadden.com@xxxxxxxxx> wrote: > > > > > > Ah, no, I hadn't seen that. Patiently awaiting .8 then. Thanks! > > > > > > On Mon, Feb 17, 2020 at 8:52 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > > > > > > > On Mon, Feb 10, 2020 at 8:31 PM John Madden <jmadden.com@xxxxxxxxx> wrote: > > > > > > > > > > Upgraded to 14.2.7, doesn't appear to have affected the behavior. As requested: > > > > > > > > In case it wasn't clear -- the fix that Patrick mentioned was > > > > postponed to 14.2.8. > > > > > > > > -- dan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx