Re: "No space left on device" errors

Yuri Weinstein <yweinste@xxxxxxxxxx> · Thu, 17 Oct 2019 09:07:29 -0700

We are back in business, thank you all !

On Thu, Oct 17, 2019 at 6:28 AM David Galloway <dgallowa@xxxxxxxxxx> wrote:
>
>
> On 10/16/19 9:07 PM, Jason Dillaman wrote:
> > On Wed, Oct 16, 2019 at 8:37 PM Kyrylo Shatskyy
> > <kyrylo.shatskyy@xxxxxxxx> wrote:
> >>
> >> Here is the log which supposed to have this issue:
> >
> > Negative -- stepping back here. I brought this issue up for these runs:
> >
> > http://pulpito.ceph.com/yuriw-2019-10-11_12:58:22-rbd-wip-yuri6-testing-2019-10-10-2057-mimic-distro-basic-smithi/
> > http://pulpito.ceph.com/yuriw-2019-10-11_19:41:35-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi/
> >
> > In these runs, you can see the OSDs crashing because it runs out of
> > space on the OSD block devices. The log you are looking at is a test
> > where the OSDs at debug log level 20 most likely actually filled up
> > "/var/log" -- but they didn't fill up their block device
> >
> > Under the runs above, when the OSD crashes, we can start to see logs like:
> >
> > 2019-10-13T09:31:17.097
> > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:17.095566
> > 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed:
> > 4 full osd(s) (OSD_FULL)
> > 2019-10-13T09:31:22.117
> > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:22.115362
> > 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update:
> > 1 full osd(s) (OSD_FULL)
> > 2019-10-13T09:31:27.632
> > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:27.630483
> > 7f189f9b9700 -1 log_channel(cluster) log [ERR] : Health check failed:
> > mon b is very low on available space (MON_DISK_CRIT)
> > 2019-10-13T09:31:28.672
> > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:28.670676
> > 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed:
> > 4 full osd(s) (OSD_FULL)
> > 2019-10-13T09:31:32.980
> > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:32.978369
> > 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update:
> > mons a,b,c are very low on available space (MON_DISK_CRIT)
> > 2019-10-13T09:31:34.061
> > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:34.053935
> > 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update:
> > 2 full osd(s) (OSD_FULL)
> > 2019-10-13T09:31:40.153
> > INFO:tasks.ceph.osd.0.smithi167.stderr:2019-10-13 09:31:40.151323
> > 7fd400753700 -1 log_channel(cluster) log [ERR] : full status failsafe
> > engaged, dropping updates, now 97% full
> > 2019-10-13T09:31:40.328
> > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:40.326090
> > 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed:
> > 1 full osd(s) (OSD_FULL)
> > 2019-10-13T09:31:40.494
> > INFO:tasks.ceph.osd.3.smithi167.stderr:2019-10-13 09:31:40.492169
> > 7fda29d9c700 -1 log_channel(cluster) log [ERR] : full status failsafe
> > engaged, dropping updates, now 97% full
> > 2019-10-13T09:31:40.900
> > INFO:tasks.ceph.osd.1.smithi167.stderr:2019-10-13 09:31:40.898979
> > 7f10785bb700 -1 log_channel(cluster) log [ERR] : full status failsafe
> > engaged, dropping updates, now 98% full
> > 2019-10-13T09:31:41.279
> > INFO:tasks.ceph.osd.4.smithi174.stderr:2019-10-13 09:31:41.277991
> > 7f8651337700 -1 log_channel(cluster) log [ERR] : full status failsafe
> > engaged, dropping updates, now 98% full
> > 2019-10-13T09:31:41.525
> > INFO:tasks.ceph.osd.2.smithi167.stderr:2019-10-13 09:31:41.523075
> > 7f1df905a700 -1 log_channel(cluster) log [ERR] : full status failsafe
> > engaged, dropping updates, now 98% full
> > 2019-10-13T09:31:42.493
> > INFO:tasks.ceph.osd.7.smithi174.stderr:2019-10-13 09:31:42.491132
> > 7febd19f8700 -1 log_channel(cluster) log [ERR] : full status failsafe
> > engaged, dropping updates, now 98% full
> > 2019-10-13T09:31:43.281
> > INFO:tasks.ceph.osd.5.smithi174.stderr:2019-10-13 09:31:43.279310
> > 7f9c80f21700 -1 log_channel(cluster) log [ERR] : full status failsafe
> > engaged, dropping updates, now 98% full
> > 2019-10-13T09:31:44.399
> > INFO:tasks.ceph.osd.6.smithi174.stderr:2019-10-13 09:31:44.398004
> > 7f4c2bf61700 -1 log_channel(cluster) log [ERR] : full status failsafe
> > engaged, dropping updates, now 98% full
> >
> > And why to the OSDs get full? Teuthology is running w/ 41a13ec included ...
> >
> > 2019-10-13T07:26:11.215 INFO:root:teuthology version: 1.0.0-471e0e3
> >
> > ... as verified by ...
> >
> > $ git log --oneline 471e0e30 | grep 41a13ec || echo ABSENT
> > 41a13eca misc: use remote.sh instead of remote.run
> >
> > While teuthology is actually formatting the NVMe devices, it never
> > actually mounts them since logs like the following are missing ...
> >
> > 2019-10-09T20:53:48.276 INFO:teuthology.orchestra.run.smithi136:> sudo
> > mount -t xfs -o noatime /dev/vg_nvme/lv_4 /var/lib/ceph/osd/ceph-0
> >
> > ... and therefore the OSDs are filling up the 1TiB root partition:.
>
> Almost perfectly described.  lv_5 gets mounted at /var/lib/ceph by
> ceph-cm-ansible and the other 4 LVs on each NVMe are *supposed* to get
> mounted at a /var/lib/ceph/osd/$id directory.  So the jobs were filling
> up the 15GB lv_5 partition if enough data was written because the OSDs
> were raw directories on lv_5 instead of mountpoints on the other LVs.
>
> Thank you much for chiming in and helping clarify the issue(s).
>
> >
> >> http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5-testing-2019-10-08-2016-luminous-distro-basic-smithi/4371741/teuthology.log
> >>
> >> $ curl -s http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5-testing-2019-10-08-2016-luminous-distro-basic-smithi/4371741/teuthology.log | grep -c "No space left"
> >> 90
> >>
> >> The 41a13ec is merged after this job even executed:
> >> 0456e3e 2019-10-10 19:01 +0200 kshtsk              M─┤ Merge pull request #1318 from kshtsk/wip-misc-use-remote-sh
> >> 41a13ec 2019-10-09 00:04 +0200 Kyr Shatskyy        │ o {origin/wip-misc-use-remote-sh} misc: use remote.sh instead of remote.run
> >>
> >> It is another proof it is not the cause of the failure.
> >> So my question, why do we still consider it is as the only reason? Do we have any other ideas?
> >>
> >> Kyrylo Shatskyy
> >> --
> >> SUSE Software Solutions Germany GmbH
> >> Maxfeldstr. 5
> >> 90409 Nuremberg
> >> Germany
> >>
> >>
> >> On Oct 17, 2019, at 2:06 AM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
> >>
> >> On Wed, Oct 16, 2019 at 8:03 PM kyr <kshatskyy@xxxxxxx> wrote:
> >>
> >>
> >> So Yuri,
> >>
> >> is it reproducible only for luminous? or you have seen this on master or any other branches?
> >>
> >>
> >> It's on all branches -- as of at least last week.
> >>
> >>
> >> Kyrylo Shatskyy
> >> --
> >> SUSE Software Solutions Germany GmbH
> >> Maxfeldstr. 5
> >> 90409 Nuremberg
> >> Germany
> >>
> >>
> >> On Oct 17, 2019, at 1:39 AM, Yuri Weinstein <yweinste@xxxxxxxxxx> wrote:
> >>
> >> Kyr
> >>
> >> Here is how I did it:
> >>
> >> RERUN=yuriw-2019-10-15_22:08:48-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi
> >> CEPH_QA_MAIL="ceph-qa@xxxxxxx"; MACHINE_NAME=smithi;
> >> CEPH_BRANCH=wip-yuri8-testing-2019-10-11-1347-luminous
> >> teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN
> >> --suite-repo https://github.com/ceph/ceph-ci.git --ceph-repo
> >> https://github.com/ceph/ceph-ci.git --suite-branch $CEPH_BRANCH -p 70
> >> -R fail,dead,running,waiting
> >>
> >> to test the fix add "-t wip-wwn-fix"
> >>
> >> On Wed, Oct 16, 2019 at 4:36 PM kyr <kshatskyy@xxxxxxx> wrote:
> >>
> >>
> >> So I ran a job on smithi against teuthology code which is supposed to have "No space left on device":
> >>
> >> http://qa-proxy.ceph.com/teuthology/kyr-2019-10-16_22:55:36-smoke:basic-master-distro-basic-smithi/4416887/teuthology.log
> >>
> >> And it passed, has not this issue. Which exact suite does reproduce the issue?
> >>
> >> Kyrylo Shatskyy
> >> --
> >> SUSE Software Solutions Germany GmbH
> >> Maxfeldstr. 5
> >> 90409 Nuremberg
> >> Germany
> >>
> >>
> >> On Oct 17, 2019, at 12:35 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> >>
> >> On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy@xxxxxxx> wrote:
> >>
> >>
> >> I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 because teuthology worker does not include that fix which is supposed to be cause for "No space left on device" issue.
> >>
> >>
> >> I'm not quite sure what you mean here. I think one of these addresses
> >> your statement?
> >> 1) we were creating very small OSDs on the root device since the
> >> partitions weren't being mounted, and so these jobs actually filled
> >> them up as a consequence of that.
> >> 2) most of the teuthology repo is pulled fresh from master on every
> >> run. The workers themselves require restarting to get updates but
> >> that's pretty rare. (See
> >> https://github.com/ceph/teuthology/blob/master/teuthology/worker.py#L82)
> >>
> >>
> >>
> >> Can some one give one-job teuthology-suite command  that 100% reproducing the issue?
> >>
> >> Kyrylo Shatskyy
> >> --
> >> SUSE Software Solutions Germany GmbH
> >> Maxfeldstr. 5
> >> 90409 Nuremberg
> >> Germany
> >>
> >>
> >> On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler@xxxxxxxx> wrote:
> >>
> >> On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote:
> >>
> >> On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa@xxxxxxxxxx> wrote:
> >>
> >>
> >> Yuri just reminded me that he's seeing this problem on the mimic branch.
> >>
> >> Does that mean this PR just needs to be backported to all branches?
> >>
> >> https://github.com/ceph/ceph/pull/30792
> >>
> >>
> >> I'd be surprised if that one (changing iteritems() to items()) could
> >> cause this, and it's not a fix for any known bugs, just ongoing py3
> >> work.
> >>
> >> When I said "that commit" I was referring to
> >> https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90598c39b,
> >> which is in the teuthology repo and thus hits every test run. Looking
> >> at the comments across https://github.com/ceph/teuthology/pull/1332
> >> and https://tracker.ceph.com/issues/42313 it sounds like that
> >> teuthology commit accidentally fixed a bug which triggered another bug
> >> that we're not sure how to resolve, but perhaps I'm misunderstanding?
> >>
> >>
> >> I think I understand what's going on. Here's an interim fix:
> >> https://github.com/ceph/teuthology/pull/1334
> >>
> >> Assuming this PR really does fix the issue, the "real" fix will be to drop
> >> get_wwn_id_map altogether, since it has long outlived its usefulness ( see
> >> https://tracker.ceph.com/issues/14855 ).
> >>
> >> Nathan
> >> _______________________________________________
> >> Dev mailing list -- dev@xxxxxxx
> >> To unsubscribe send an email to dev-leave@xxxxxxx
> >>
> >>
> >> _______________________________________________
> >> Dev mailing list -- dev@xxxxxxx
> >> To unsubscribe send an email to dev-leave@xxxxxxx
> >>
> >>
> >> _______________________________________________
> >> Dev mailing list -- dev@xxxxxxx
> >> To unsubscribe send an email to dev-leave@xxxxxxx
> >>
> >>
> >>
> >>
> >> --
> >> Jason
> >>
> >>
> >
> >
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx