Re: "No space left on device" errors

Jason Dillaman <jdillama@xxxxxxxxxx> · Wed, 16 Oct 2019 21:07:24 -0400

On Wed, Oct 16, 2019 at 8:37 PM Kyrylo Shatskyy
<kyrylo.shatskyy@xxxxxxxx> wrote:
>
> Here is the log which supposed to have this issue:

Negative -- stepping back here. I brought this issue up for these runs:

http://pulpito.ceph.com/yuriw-2019-10-11_12:58:22-rbd-wip-yuri6-testing-2019-10-10-2057-mimic-distro-basic-smithi/
http://pulpito.ceph.com/yuriw-2019-10-11_19:41:35-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi/

In these runs, you can see the OSDs crashing because it runs out of
space on the OSD block devices. The log you are looking at is a test
where the OSDs at debug log level 20 most likely actually filled up
"/var/log" -- but they didn't fill up their block device

Under the runs above, when the OSD crashes, we can start to see logs like:

2019-10-13T09:31:17.097
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:17.095566
7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed:
4 full osd(s) (OSD_FULL)
2019-10-13T09:31:22.117
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:22.115362
7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update:
1 full osd(s) (OSD_FULL)
2019-10-13T09:31:27.632
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:27.630483
7f189f9b9700 -1 log_channel(cluster) log [ERR] : Health check failed:
mon b is very low on available space (MON_DISK_CRIT)
2019-10-13T09:31:28.672
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:28.670676
7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed:
4 full osd(s) (OSD_FULL)
2019-10-13T09:31:32.980
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:32.978369
7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update:
mons a,b,c are very low on available space (MON_DISK_CRIT)
2019-10-13T09:31:34.061
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:34.053935
7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update:
2 full osd(s) (OSD_FULL)
2019-10-13T09:31:40.153
INFO:tasks.ceph.osd.0.smithi167.stderr:2019-10-13 09:31:40.151323
7fd400753700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 97% full
2019-10-13T09:31:40.328
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:40.326090
7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed:
1 full osd(s) (OSD_FULL)
2019-10-13T09:31:40.494
INFO:tasks.ceph.osd.3.smithi167.stderr:2019-10-13 09:31:40.492169
7fda29d9c700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 97% full
2019-10-13T09:31:40.900
INFO:tasks.ceph.osd.1.smithi167.stderr:2019-10-13 09:31:40.898979
7f10785bb700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 98% full
2019-10-13T09:31:41.279
INFO:tasks.ceph.osd.4.smithi174.stderr:2019-10-13 09:31:41.277991
7f8651337700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 98% full
2019-10-13T09:31:41.525
INFO:tasks.ceph.osd.2.smithi167.stderr:2019-10-13 09:31:41.523075
7f1df905a700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 98% full
2019-10-13T09:31:42.493
INFO:tasks.ceph.osd.7.smithi174.stderr:2019-10-13 09:31:42.491132
7febd19f8700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 98% full
2019-10-13T09:31:43.281
INFO:tasks.ceph.osd.5.smithi174.stderr:2019-10-13 09:31:43.279310
7f9c80f21700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 98% full
2019-10-13T09:31:44.399
INFO:tasks.ceph.osd.6.smithi174.stderr:2019-10-13 09:31:44.398004
7f4c2bf61700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 98% full

And why to the OSDs get full? Teuthology is running w/ 41a13ec included ...

2019-10-13T07:26:11.215 INFO:root:teuthology version: 1.0.0-471e0e3

... as verified by ...

$ git log --oneline 471e0e30 | grep 41a13ec || echo ABSENT
41a13eca misc: use remote.sh instead of remote.run

While teuthology is actually formatting the NVMe devices, it never
actually mounts them since logs like the following are missing ...

2019-10-09T20:53:48.276 INFO:teuthology.orchestra.run.smithi136:> sudo
mount -t xfs -o noatime /dev/vg_nvme/lv_4 /var/lib/ceph/osd/ceph-0

... and therefore the OSDs are filling up the 1TiB root partition:.

> http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5-testing-2019-10-08-2016-luminous-distro-basic-smithi/4371741/teuthology.log
>
> $ curl -s http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5-testing-2019-10-08-2016-luminous-distro-basic-smithi/4371741/teuthology.log | grep -c "No space left"
> 90
>
> The 41a13ec is merged after this job even executed:
> 0456e3e 2019-10-10 19:01 +0200 kshtsk              M─┤ Merge pull request #1318 from kshtsk/wip-misc-use-remote-sh
> 41a13ec 2019-10-09 00:04 +0200 Kyr Shatskyy        │ o {origin/wip-misc-use-remote-sh} misc: use remote.sh instead of remote.run
>
> It is another proof it is not the cause of the failure.
> So my question, why do we still consider it is as the only reason? Do we have any other ideas?
>
> Kyrylo Shatskyy
> --
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5
> 90409 Nuremberg
> Germany
>
>
> On Oct 17, 2019, at 2:06 AM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>
> On Wed, Oct 16, 2019 at 8:03 PM kyr <kshatskyy@xxxxxxx> wrote:
>
>
> So Yuri,
>
> is it reproducible only for luminous? or you have seen this on master or any other branches?
>
>
> It's on all branches -- as of at least last week.
>
>
> Kyrylo Shatskyy
> --
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5
> 90409 Nuremberg
> Germany
>
>
> On Oct 17, 2019, at 1:39 AM, Yuri Weinstein <yweinste@xxxxxxxxxx> wrote:
>
> Kyr
>
> Here is how I did it:
>
> RERUN=yuriw-2019-10-15_22:08:48-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi
> CEPH_QA_MAIL="ceph-qa@xxxxxxx"; MACHINE_NAME=smithi;
> CEPH_BRANCH=wip-yuri8-testing-2019-10-11-1347-luminous
> teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN
> --suite-repo https://github.com/ceph/ceph-ci.git --ceph-repo
> https://github.com/ceph/ceph-ci.git --suite-branch $CEPH_BRANCH -p 70
> -R fail,dead,running,waiting
>
> to test the fix add "-t wip-wwn-fix"
>
> On Wed, Oct 16, 2019 at 4:36 PM kyr <kshatskyy@xxxxxxx> wrote:
>
>
> So I ran a job on smithi against teuthology code which is supposed to have "No space left on device":
>
> http://qa-proxy.ceph.com/teuthology/kyr-2019-10-16_22:55:36-smoke:basic-master-distro-basic-smithi/4416887/teuthology.log
>
> And it passed, has not this issue. Which exact suite does reproduce the issue?
>
> Kyrylo Shatskyy
> --
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5
> 90409 Nuremberg
> Germany
>
>
> On Oct 17, 2019, at 12:35 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy@xxxxxxx> wrote:
>
>
> I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 because teuthology worker does not include that fix which is supposed to be cause for "No space left on device" issue.
>
>
> I'm not quite sure what you mean here. I think one of these addresses
> your statement?
> 1) we were creating very small OSDs on the root device since the
> partitions weren't being mounted, and so these jobs actually filled
> them up as a consequence of that.
> 2) most of the teuthology repo is pulled fresh from master on every
> run. The workers themselves require restarting to get updates but
> that's pretty rare. (See
> https://github.com/ceph/teuthology/blob/master/teuthology/worker.py#L82)
>
>
>
> Can some one give one-job teuthology-suite command  that 100% reproducing the issue?
>
> Kyrylo Shatskyy
> --
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5
> 90409 Nuremberg
> Germany
>
>
> On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler@xxxxxxxx> wrote:
>
> On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote:
>
> On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa@xxxxxxxxxx> wrote:
>
>
> Yuri just reminded me that he's seeing this problem on the mimic branch.
>
> Does that mean this PR just needs to be backported to all branches?
>
> https://github.com/ceph/ceph/pull/30792
>
>
> I'd be surprised if that one (changing iteritems() to items()) could
> cause this, and it's not a fix for any known bugs, just ongoing py3
> work.
>
> When I said "that commit" I was referring to
> https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90598c39b,
> which is in the teuthology repo and thus hits every test run. Looking
> at the comments across https://github.com/ceph/teuthology/pull/1332
> and https://tracker.ceph.com/issues/42313 it sounds like that
> teuthology commit accidentally fixed a bug which triggered another bug
> that we're not sure how to resolve, but perhaps I'm misunderstanding?
>
>
> I think I understand what's going on. Here's an interim fix:
> https://github.com/ceph/teuthology/pull/1334
>
> Assuming this PR really does fix the issue, the "real" fix will be to drop
> get_wwn_id_map altogether, since it has long outlived its usefulness ( see
> https://tracker.ceph.com/issues/14855 ).
>
> Nathan
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
>
>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
>
>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
>
>
>
>
> --
> Jason
>
>

-- 
Jason
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx