Re: CephFS: No space left on device

John Spray <jspray@xxxxxxxxxx> · Sun, 2 Oct 2016 21:27:22 +0100

On Sun, Oct 2, 2016 at 11:09 AM, Mykola Dvornik
<mykola.dvornik@xxxxxxxxx> wrote:
> After upgrading to 10.2.3 we frequently see messages like

>From which version did you upgrade?

> 'rm: cannot remove '...': No space left on device
>
> The folders we are trying to delete contain approx. 50K files 193 KB each.

My guess would be that you are hitting the new
mds_bal_fragment_size_max check.  This limits the number of entries
that the MDS will create in a single directory fragment, to avoid
overwhelming the OSD with oversized objects.  It is 100000 by default.
This limit also applies to "stray" directories where unlinked files
are put while they wait to be purged, so you could get into this state
while doing lots of deletions.  There are ten stray directories that
get a roughly even share of files, so if you have more than about one
million files waiting to be purged, you could see this condition.

The "Client failing to respond to cache pressure" messages may play a
part here -- if you have misbehaving clients then they may cause the
MDS to delay purging stray files, leading to a backlog.  If your
clients are by any chance older kernel clients, you should upgrade
them.  You can also unmount/remount them to clear this state, although
it will reoccur until the clients are updated (or until the bug is
fixed, if you're running latest clients already).

The high level counters for strays are part of the default output of
"ceph daemonperf mds.<id>" when run on the MDS server (the "stry" and
"purg" columns).  You can look at these to watch how fast the MDS is
clearing out strays.  If your backlog is just because it's not doing
it fast enough, then you can look at tuning mds_max_purge_files and
mds_max_purge_ops to adjust the throttles on purging.  Those settings
can be adjusted without restarting the MDS using the "injectargs"
command (http://docs.ceph.com/docs/master/rados/operations/control/#mds-subsystem)

Let us know how you get on.

John

> The cluster state and storage available are both OK:
>
>     cluster 98d72518-6619-4b5c-b148-9a781ef13bcb
>      health HEALTH_WARN
>             mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
> pressure
>             mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
> pressure
>             mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
> pressure
>             mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
> pressure
>             mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
> pressure
>      monmap e1: 1 mons at {000-s-ragnarok=XXX.XXX.XXX.XXX:6789/0}
>             election epoch 11, quorum 0 000-s-ragnarok
>       fsmap e62643: 1/1/1 up {0=000-s-ragnarok=up:active}
>      osdmap e20203: 16 osds: 16 up, 16 in
>             flags sortbitwise
>       pgmap v15284654: 1088 pgs, 2 pools, 11263 GB data, 40801 kobjects
>             23048 GB used, 6745 GB / 29793 GB avail
>                 1085 active+clean
>                    2 active+clean+scrubbing
>                    1 active+clean+scrubbing+deep
>
>
> Has anybody experienced this issue so far?
>
> Regards,
> --
>  Mykola
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com