On Sun, Oct 2, 2016 at 11:09 AM, Mykola Dvornik <mykola.dvornik@xxxxxxxxx> wrote: > After upgrading to 10.2.3 we frequently see messages like >From which version did you upgrade? > 'rm: cannot remove '...': No space left on device > > The folders we are trying to delete contain approx. 50K files 193 KB each. My guess would be that you are hitting the new mds_bal_fragment_size_max check. This limits the number of entries that the MDS will create in a single directory fragment, to avoid overwhelming the OSD with oversized objects. It is 100000 by default. This limit also applies to "stray" directories where unlinked files are put while they wait to be purged, so you could get into this state while doing lots of deletions. There are ten stray directories that get a roughly even share of files, so if you have more than about one million files waiting to be purged, you could see this condition. The "Client failing to respond to cache pressure" messages may play a part here -- if you have misbehaving clients then they may cause the MDS to delay purging stray files, leading to a backlog. If your clients are by any chance older kernel clients, you should upgrade them. You can also unmount/remount them to clear this state, although it will reoccur until the clients are updated (or until the bug is fixed, if you're running latest clients already). The high level counters for strays are part of the default output of "ceph daemonperf mds.<id>" when run on the MDS server (the "stry" and "purg" columns). You can look at these to watch how fast the MDS is clearing out strays. If your backlog is just because it's not doing it fast enough, then you can look at tuning mds_max_purge_files and mds_max_purge_ops to adjust the throttles on purging. Those settings can be adjusted without restarting the MDS using the "injectargs" command (http://docs.ceph.com/docs/master/rados/operations/control/#mds-subsystem) Let us know how you get on. John > The cluster state and storage available are both OK: > > cluster 98d72518-6619-4b5c-b148-9a781ef13bcb > health HEALTH_WARN > mds0: Client XXX.XXX.XXX.XXX failing to respond to cache > pressure > mds0: Client XXX.XXX.XXX.XXX failing to respond to cache > pressure > mds0: Client XXX.XXX.XXX.XXX failing to respond to cache > pressure > mds0: Client XXX.XXX.XXX.XXX failing to respond to cache > pressure > mds0: Client XXX.XXX.XXX.XXX failing to respond to cache > pressure > monmap e1: 1 mons at {000-s-ragnarok=XXX.XXX.XXX.XXX:6789/0} > election epoch 11, quorum 0 000-s-ragnarok > fsmap e62643: 1/1/1 up {0=000-s-ragnarok=up:active} > osdmap e20203: 16 osds: 16 up, 16 in > flags sortbitwise > pgmap v15284654: 1088 pgs, 2 pools, 11263 GB data, 40801 kobjects > 23048 GB used, 6745 GB / 29793 GB avail > 1085 active+clean > 2 active+clean+scrubbing > 1 active+clean+scrubbing+deep > > > Has anybody experienced this issue so far? > > Regards, > -- > Mykola > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com