Re: "No space left on device" errors

Kyrylo Shatskyy <kyrylo.shatskyy@xxxxxxxx> · Thu, 17 Oct 2019 00:36:35 +0000

Here is the log which supposed to have this issue:

http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5-testing-2019-10-08-2016-luminous-distro-basic-smithi/4371741/teuthology.log

$ curl -s 
http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5-testing-2019-10-08-2016-luminous-distro-basic-smithi/4371741/teuthology.log | grep -c "No space left"
90

The 41a13ec is merged after this job even executed:
0456e3e 2019-10-10 19:01 +0200 kshtsk              M─┤ Merge pull request #1318 from kshtsk/wip-misc-use-remote-sh

41a13ec 2019-10-09 00:04 +0200 Kyr Shatskyy        │ o {origin/wip-misc-use-remote-sh} misc: use remote.sh instead of remote.run

It is another proof it is not the cause of the failure.
So my question, why do we still consider it is as the only reason? Do we have any other ideas?

Kyrylo Shatskyy
--
SUSE Software Solutions Germany GmbH

Maxfeldstr. 5

90409 Nuremberg

Germany

On Oct 17, 2019, at 2:06 AM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:

On
 Wed, Oct 16, 2019 at 8:03 PM kyr <kshatskyy@xxxxxxx>
 wrote:

So Yuri,

is it reproducible only for luminous? or you have seen this on master or any other branches?

It's
 on all branches -- as of at least last week.

Kyrylo Shatskyy

--

SUSE Software Solutions Germany GmbH

Maxfeldstr. 5

90409 Nuremberg

Germany

On Oct 17, 2019, at 1:39 AM, Yuri Weinstein <yweinste@xxxxxxxxxx> wrote:

Kyr

Here is how I did it:

RERUN=yuriw-2019-10-15_22:08:48-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi

CEPH_QA_MAIL="ceph-qa@xxxxxxx"; MACHINE_NAME=smithi;

CEPH_BRANCH=wip-yuri8-testing-2019-10-11-1347-luminous

teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN

--suite-repo https://github.com/ceph/ceph-ci.git --ceph-repo

https://github.com/ceph/ceph-ci.git --suite-branch $CEPH_BRANCH -p 70

-R fail,dead,running,waiting

to test the fix add "-t wip-wwn-fix"

On Wed, Oct 16, 2019 at 4:36 PM kyr <kshatskyy@xxxxxxx> wrote:

So I ran a job on smithi against teuthology code which is supposed to have "No space left on device":

http://qa-proxy.ceph.com/teuthology/kyr-2019-10-16_22:55:36-smoke:basic-master-distro-basic-smithi/4416887/teuthology.log

And it passed, has not this issue. Which exact suite does reproduce the issue?

Kyrylo Shatskyy

--

SUSE Software Solutions Germany GmbH

Maxfeldstr. 5

90409 Nuremberg

Germany

On Oct 17, 2019, at 12:35 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy@xxxxxxx> wrote:

I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 because teuthology worker does not include that fix which is supposed to be cause for "No space left
 on device" issue.

I'm not quite sure what you mean here. I think one of these addresses

your statement?

1) we were creating very small OSDs on the root device since the

partitions weren't being mounted, and so these jobs actually filled

them up as a consequence of that.

2) most of the teuthology repo is pulled fresh from master on every

run. The workers themselves require restarting to get updates but

that's pretty rare. (See

https://github.com/ceph/teuthology/blob/master/teuthology/worker.py#L82)

Can some one give one-job teuthology-suite command  that 100% reproducing the issue?

Kyrylo Shatskyy

--

SUSE Software Solutions Germany GmbH

Maxfeldstr. 5

90409 Nuremberg

Germany

On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler@xxxxxxxx> wrote:

On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote:

On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa@xxxxxxxxxx> wrote:

Yuri just reminded me that he's seeing this problem on the mimic branch.

Does that mean this PR just needs to be backported to all branches?

https://github.com/ceph/ceph/pull/30792

I'd be surprised if that one (changing iteritems() to items()) could

cause this, and it's not a fix for any known bugs, just ongoing py3

work.

When I said "that commit" I was referring to

https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90598c39b,

which is in the teuthology repo and thus hits every test run. Looking

at the comments across https://github.com/ceph/teuthology/pull/1332

and https://tracker.ceph.com/issues/42313 it sounds like that

teuthology commit accidentally fixed a bug which triggered another bug

that we're not sure how to resolve, but perhaps I'm misunderstanding?

I think I understand what's going on. Here's an interim fix:

https://github.com/ceph/teuthology/pull/1334

Assuming this PR really does fix the issue, the "real" fix will be to drop

get_wwn_id_map altogether, since it has long outlived its usefulness ( see

https://tracker.ceph.com/issues/14855 ).

Nathan

_______________________________________________

Dev mailing list -- dev@xxxxxxx

To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________

Dev mailing list -- dev@xxxxxxx

To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________

Dev mailing list -- dev@xxxxxxx

To unsubscribe send an email to dev-leave@xxxxxxx

-- 

Jason

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx