On Tue, Apr 14, 2020 at 11:45 PM Yan, Zheng <ukernel@xxxxxxxxx> wrote: > > On Tue, Apr 14, 2020 at 9:41 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > > > On Tue, Apr 14, 2020 at 2:50 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > > > > > On Sun, Apr 12, 2020 at 9:33 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > > > > > > > Hi John, > > > > > > > > Did you make any progress on investigating this? > > > > > > > > Today I also saw huge relative buffer_anon usage on our 2 active mds's > > > > running 14.2.8: > > > > > > > > "mempool": { > > > > "by_pool": { > > > > "bloom_filter": { > > > > "items": 2322, > > > > "bytes": 2322 > > > > }, > > > > ... > > > > "buffer_anon": { > > > > "items": 4947214, > > > > "bytes": 19785847411 > > > > }, > > > > ... > > > > "osdmap": { > > > > "items": 4036, > > > > "bytes": 89488 > > > > }, > > > > ... > > > > "mds_co": { > > > > "items": 9248718, > > > > "bytes": 157725128 > > > > }, > > > > ... > > > > }, > > > > "total": { > > > > "items": 14202290, > > > > "bytes": 19943664349 > > > > } > > > > } > > > > > > > > That mds has `mds cache memory limit = 15353442304` and there was no > > > > health warning about the mds memory usage exceeding the limit. > > > > (I only noticed because some other crons on the mds's were going oom). > > > > > > > > Patrick: is there any known memory leak in nautilus mds's ? > > > > > > I restarted one MDS with ms_type = simple and that MDS maintained a > > > normal amount of buffer_anon for several hours, while the other active > > > MDS (with async ms type) saw its buffer_anon grow by some ~10GB > > > overnight. > > > So, it seems there are still memory leaks with ms_type = async in 14.2.8. > > > > > > OTOH, the whole cluster is kinda broken now due to > > > https://tracker.ceph.com/issues/45080, which may be related to the > > > ms_type=simple .. I'm still debugging. > > > > Indeed, the combination of msgr v2 and `ms type = simple` on a > > ceph-mds leads to deadlocked mds ops as soon as any osd restarts. > > Looks like we have to find the root cause of the memory leak rather > > than working around it with ms type = simple. > > > > Dan > > > > I opened https://tracker.ceph.com/issues/45090. It can explain the > buffer_anon memory use. > please try https://github.com/ceph/ceph/pull/34571 if you can compile ceph from source. Regards Yan, Zheng > Regards > Yan, Zheng > > > > > > > > > Cheers, Dan > > > > > > > Any tips to debug this further? > > > > > > > > Cheers, Dan > > > > > > > > On Wed, Mar 4, 2020 at 8:38 PM John Madden <jmadden.com@xxxxxxxxx> wrote: > > > > > > > > > > Though it appears potentially(?) better, I'm still having issues with > > > > > this on 14.2.8. Kick off the ~20 threads sequentially reading ~1M > > > > > files and buffer_anon still grows apparently without bound. > > > > > > > > > > mds.1 tcmalloc heap stats:------------------------------------------------ > > > > > MALLOC: 53710413656 (51222.2 MiB) Bytes in use by application > > > > > MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist > > > > > MALLOC: + 334028128 ( 318.6 MiB) Bytes in central cache freelist > > > > > MALLOC: + 11210608 ( 10.7 MiB) Bytes in transfer cache freelist > > > > > MALLOC: + 11105240 ( 10.6 MiB) Bytes in thread cache freelists > > > > > MALLOC: + 77525152 ( 73.9 MiB) Bytes in malloc metadata > > > > > MALLOC: ------------ > > > > > MALLOC: = 54144282784 (51636.0 MiB) Actual memory used (physical + swap) > > > > > MALLOC: + 49963008 ( 47.6 MiB) Bytes released to OS (aka unmapped) > > > > > MALLOC: ------------ > > > > > MALLOC: = 54194245792 (51683.7 MiB) Virtual address space used > > > > > MALLOC: > > > > > MALLOC: 262021 Spans in use > > > > > MALLOC: 18 Thread heaps in use > > > > > MALLOC: 8192 Tcmalloc page size > > > > > ------------------------------------------------ > > > > > > > > > > The byte count appears to grow even as the item count drops, though > > > > > the trend is for both to increase over the life of the workload: > > > > > ceph daemon mds.1 dump_mempools | jq .mempool.by_pool.buffer_anon: > > > > > > > > > > { > > > > > "items": 28045, > > > > > "bytes": 24197601109 > > > > > } > > > > > { > > > > > "items": 27132, > > > > > "bytes": 24262495865 > > > > > } > > > > > { > > > > > "items": 27105, > > > > > "bytes": 24262537939 > > > > > } > > > > > { > > > > > "items": 33309, > > > > > "bytes": 29754507505 > > > > > } > > > > > { > > > > > "items": 36160, > > > > > "bytes": 31803033733 > > > > > } > > > > > { > > > > > "items": 56772, > > > > > "bytes": 51062350351 > > > > > } > > > > > > > > > > Is there further data/debug I can retrieve to help track this down? > > > > > > > > > > > > > > > On Wed, Feb 19, 2020 at 4:38 PM John Madden <jmadden.com@xxxxxxxxx> wrote: > > > > > > > > > > > > Ah, no, I hadn't seen that. Patiently awaiting .8 then. Thanks! > > > > > > > > > > > > On Mon, Feb 17, 2020 at 8:52 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > On Mon, Feb 10, 2020 at 8:31 PM John Madden <jmadden.com@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > > Upgraded to 14.2.7, doesn't appear to have affected the behavior. As requested: > > > > > > > > > > > > > > In case it wasn't clear -- the fix that Patrick mentioned was > > > > > > > postponed to 14.2.8. > > > > > > > > > > > > > > -- dan > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx