Re: "No space left on device" errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Here is the log which supposed to have this issue:


$ curl -s http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5-testing-2019-10-08-2016-luminous-distro-basic-smithi/4371741/teuthology.log | grep -c "No space left"
90

The 41a13ec is merged after this job even executed:
0456e3e 2019-10-10 19:01 +0200 kshtsk              M─┤ Merge pull request #1318 from kshtsk/wip-misc-use-remote-sh
41a13ec 2019-10-09 00:04 +0200 Kyr Shatskyy        │ o {origin/wip-misc-use-remote-sh} misc: use remote.sh instead of remote.run

It is another proof it is not the cause of the failure.
So my question, why do we still consider it is as the only reason? Do we have any other ideas?

Kyrylo Shatskyy
--
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nuremberg
Germany


On Oct 17, 2019, at 2:06 AM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:

On Wed, Oct 16, 2019 at 8:03 PM kyr <kshatskyy@xxxxxxx> wrote:

So Yuri,

is it reproducible only for luminous? or you have seen this on master or any other branches?

It's on all branches -- as of at least last week.


Kyrylo Shatskyy
--
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nuremberg
Germany


On Oct 17, 2019, at 1:39 AM, Yuri Weinstein <yweinste@xxxxxxxxxx> wrote:

Kyr

Here is how I did it:

RERUN=yuriw-2019-10-15_22:08:48-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi
CEPH_QA_MAIL="ceph-qa@xxxxxxx"; MACHINE_NAME=smithi;
CEPH_BRANCH=wip-yuri8-testing-2019-10-11-1347-luminous
teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN
--suite-repo https://github.com/ceph/ceph-ci.git --ceph-repo
https://github.com/ceph/ceph-ci.git --suite-branch $CEPH_BRANCH -p 70
-R fail,dead,running,waiting

to test the fix add "-t wip-wwn-fix"

On Wed, Oct 16, 2019 at 4:36 PM kyr <kshatskyy@xxxxxxx> wrote:

So I ran a job on smithi against teuthology code which is supposed to have "No space left on device":

http://qa-proxy.ceph.com/teuthology/kyr-2019-10-16_22:55:36-smoke:basic-master-distro-basic-smithi/4416887/teuthology.log

And it passed, has not this issue. Which exact suite does reproduce the issue?

Kyrylo Shatskyy
--
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nuremberg
Germany


On Oct 17, 2019, at 12:35 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy@xxxxxxx> wrote:


I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 because teuthology worker does not include that fix which is supposed to be cause for "No space left on device" issue.


I'm not quite sure what you mean here. I think one of these addresses
your statement?
1) we were creating very small OSDs on the root device since the
partitions weren't being mounted, and so these jobs actually filled
them up as a consequence of that.
2) most of the teuthology repo is pulled fresh from master on every
run. The workers themselves require restarting to get updates but
that's pretty rare. (See
https://github.com/ceph/teuthology/blob/master/teuthology/worker.py#L82)



Can some one give one-job teuthology-suite command  that 100% reproducing the issue?

Kyrylo Shatskyy
--
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nuremberg
Germany


On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler@xxxxxxxx> wrote:

On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote:

On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa@xxxxxxxxxx> wrote:


Yuri just reminded me that he's seeing this problem on the mimic branch.

Does that mean this PR just needs to be backported to all branches?

https://github.com/ceph/ceph/pull/30792


I'd be surprised if that one (changing iteritems() to items()) could
cause this, and it's not a fix for any known bugs, just ongoing py3
work.

When I said "that commit" I was referring to
https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90598c39b,
which is in the teuthology repo and thus hits every test run. Looking
at the comments across https://github.com/ceph/teuthology/pull/1332
and https://tracker.ceph.com/issues/42313 it sounds like that
teuthology commit accidentally fixed a bug which triggered another bug
that we're not sure how to resolve, but perhaps I'm misunderstanding?


I think I understand what's going on. Here's an interim fix:
https://github.com/ceph/teuthology/pull/1334

Assuming this PR really does fix the issue, the "real" fix will be to drop
get_wwn_id_map altogether, since it has long outlived its usefulness ( see
https://tracker.ceph.com/issues/14855 ).

Nathan
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx


_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx


_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




-- 
Jason

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux