Fatigue for XFS

andrey@xxxxxxx (Andrey Korolyov) · Mon, 5 May 2014 23:49:05 +0400

Hello,

We are currently exploring issue which can be related to Ceph itself
or to the XFS - any help is very appreciated.

First, the picture: relatively old cluster w/ two years uptime and ten
months after fs recreation on every OSD, one of daemons started to
flap approximately once per day for couple of weeks, with no external
reason (bandwidth/IOPS/host issues). It looks almost the same every
time - OSD suddenly stop serving requests for a short period, gets
kicked out by peers report, then returns in a couple of seconds. Of
course, small but sensitive amount of requests are delayed by 15-30
seconds twice, which is bad for us. The only thing which correlates
with this kick is a peak of I/O, not too large, even not consuming all
underlying disk utilization, but alone in the cluster and clearly
visible. Also there are at least two occasions *without* correlated
iowait peak.

I have two versions - we`re touching some sector on disk which is
about to be marked as dead but not displayed in SMART statistics or (I
believe so) some kind of XFS fatigue, which can be more likely in this
case, since near-bad sector should be touched more frequently and
related impact could leave traces in dmesg/SMART from my experience. I
would like to ask if anyone has a simular experience before or can
suggest to poke existing file system in some way. If no suggestion
appear, I`ll probably reformat disk and, if problem will remain after
refill, replace it, but I think less destructive actions can be done
before.

XFS is running on 3.10 with almost default create and mount options,
ceph version is the latest cuttlefish (this rack should be upgraded, I
know).