infinite osd operation RETRY loop

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I was going to write this e-mail asking about any [unwritten] rules
for sharing bufferlist instances across librados API invocations after
seeing some non-intuitive behavior, but some simple changes to the
test make it seem more and more like a bug. I've replicated this on
the latest master, and a commit back closer to when mimic branched
(circa May).

The test below calls the `hello.say_hello` objclass method. For the
invocations here this is effectively an echo test: it directly writes
the input back into the output. It's called twice, and the `outbl`
that receives the response from `ioctx::operate` is shared across the
calls.

 ceph::bufferlist outbl;

  {
    char bytes[24];
    ceph::bufferlist bl;
    bl.append(bytes, sizeof(bytes));
    librados::ObjectReadOperation op;
    op.exec("hello", "say_hello", bl);
    ret = ioctx.operate("obj", &op, &outbl);
    ASSERT_EQ(ret, 0);
  }

  ASSERT_TRUE(outbl.length() > 0);

  {
    char bytes[36];
    ceph::bufferlist bl;
    bl.append(bytes, sizeof(bytes));
    librados::ObjectReadOperation op;
    op.exec("hello", "say_hello", bl);
    ret = ioctx.operate("obj", &op, &outbl);
    ASSERT_EQ(ret, 0);
  }

The behavior I see when running this test is that on the second
invocation of the objclass method the client never returns, and the
OSD and client retry the operation an unbounded number of times. Here
it is with RETRY=189

2018-12-23 17:52:23.646 7fa673edd700 15 osd.0 pg_epoch: 7 pg[1.1( v
7'1 (0'0,7'1] local-lis/les=6/7 n=1 ec=6/6 lis/c 6/6 les/c/f 7/7/0
6/6/6) [0] r=0 lpr=6 crt=7'1 lcod 0'0 mlcod 0'0 active+clean]
log_op_stats osd_op(client.4125.0:4 1.1 1:847a3d55:::obj:head [call
hello.say_hello] snapc 0=[] RETRY=189
ondisk+retry+read+known_if_redirected e7) v8 inb 0 outb 44 lat
0.000871

And the full log is here:
https://paste.fedoraproject.org/paste/zYkKNU1iAB7Mt3thpjKM7w

If one clears outbl before the second invocation, that is:

  ASSERT_TRUE(outbl.length() > 0);
  outbl.clear();

Then the entire test succeeds. That seemed like non-intuitive
behavior, but maybe its just not documented that `ioctx::operate`
assumes an empty bufferlist. However, the following change _does not_
trigger the error:

  ASSERT_TRUE(outbl.length() > 0);
  outbl.clear();
  outbl.append("foo", 3);

Suggesting that a non-empty bufferlist is OK (at least it doesn't
cause the infinite RETRY loop), and there is some lingering state some
where.

- Noah



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux