Hi Andras, On Thu, Jan 18, 2018 at 3:38 AM, Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote: > Hi John, > > Some other symptoms of the problem: when the MDS has been running for a few > days, it starts looking really busy. At this time, listing directories > becomes really slow. An "ls -l" on a directory with about 250 entries takes > about 2.5 seconds. All the metadata is on OSDs with NVMe backing stores. > Interestingly enough the memory usage seems pretty low (compared to the > allowed cache limit). > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > 1604408 ceph 20 0 3710304 2.387g 18360 S 100.0 0.9 757:06.92 > /usr/bin/ceph-mds -f --cluster ceph --id cephmon00 --setuser ceph --setgroup > ceph > > Once I bounce it (fail it over), the CPU usage goes down to the 10-25% > range. The same ls -l after the bounce takes about 0.5 seconds. I > remounted the filesystem before each test to ensure there isn't anything > cached. > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > 111100 ceph 20 0 6537052 5.864g 18500 S 17.6 2.3 9:23.55 > /usr/bin/ceph-mds -f --cluster ceph --id cephmon02 --setuser ceph --setgroup > ceph > > Also, I have a crawler that crawls the file system periodically. Normally > the full crawl runs for about 24 hours, but with the slowing down MDS, now > it has been running for more than 2 days and isn't close to finishing. > > The MDS related settings we are running with are: > > mds_cache_memory_limit = 17179869184 > mds_cache_reservation = 0.10 Debug logs from the MDS at that time would be helpful with `debug mds = 20` and `debug ms = 1`. Feel free to create a tracker ticket and use ceph-post-file [1] to share logs. [1] http://docs.ceph.com/docs/hammer/man/8/ceph-post-file/ -- Patrick Donnelly _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com