Re: Throttle.cc: 194: FAILED assert(c >= 0) on snap rm or create in Ceph 0.87

Simon Ruggier <simon@xxxxxxxxxxx> · Mon, 29 Oct 2018 18:34:14 -0400

To attempt to answer your questions, I just ran a for loop to list the
snapshots on every volume, and that finishes successfully. There are a
few other snapshots on volumes in the pool, but the usage pattern is
generally that snapshots are created to provide a point-in-time copy
of the volume. The snapshots are copied elsewhere and then once the
final copy of the backup has been written, the snapshot is removed.
So, there's about 6000 volumes, and not many snapshots, but they get
created and deleted regularly.

The pool seems like it's functioning in general. I can't say
with confidence that there aren't any other unexplained errors or
assertion failures happening with specific volumes or snapshots,
similar to this one, but I'm not aware of any other cases as of now.

On Mon, Oct 29, 2018 at 5:26 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> [ Re-send to make it through the vger filters; sorry! ]
>
> Hmm, yeah. The ticket and failure mode makes me wonder if something
> has gotten so strange with this image that the notify's bufferlist
> actually exceeded a reasonable size, but I don't really see a
> mechanism for that.
> What snapshots exist on the pool? Can you successfully examine it in
> other ways with the rbd image manipulation tools?
>
> On Fri, Oct 26, 2018 at 3:48 PM Simon Ruggier <simon@xxxxxxxxxxx> wrote:
> >
> > First of all, thanks for your reply.
> >
> > Yeah, this is happening within the process executing the rbd command.
> > Sorry I didn't include the backtrace in my original email, I
> > completely forgot after putting together the rest of it.
> >
> > I set "debug objecter = 20" in the local ceph config file on the
> > system I ran these commands on, then ran rbd snap create, snap ls, and
> > snap rm, so you could look at debug output from any of those
> > three. I saved the entire session, anonymized all names in the output,
> > and compressed it. See attached. If you need any other information,
> > let me know and I'll collect it when I'm able to.
> > On Fri, Oct 26, 2018 at 5:19 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > >
> > > This is happening on the client side? Can you provide the full
> > > backtrace and a log with "debug objecter = 20" turned on?
> > >
> > > On Sun, Oct 21, 2018 at 11:25 AM Simon Ruggier <simon@xxxxxxxxxxx> wrote:
> > > >
> > > > Hi, I'm writing about a problem I'm seeing in a Ceph 0.87 cluster
> > > > where rbd snap create, rm, etc. are succeeding, but aborting with a
> > > > non-zero return code because the notify call at the very end of the
> > > > function (https://github.com/ceph/ceph/blob/v0.87/src/librbd/internal.cc#L468)
> > > >  is hitting an assertion failure (Throttle.cc: 194: FAILED assert(c >=
> > > > 0)).
> > > >
> > > > I did a bit of digging, and found that c is calculated in
> > > > calc_op_budget (https://github.com/ceph/ceph/blob/v0.87/src/osdc/Objecter.cc#L2453-L2471),
> > > > which is called in Objecter::_take_op_budget
> > > > (https://github.com/ceph/ceph/blob/v0.87/src/osdc/Objecter.h#L1597-L1608),
> > > > but could hypothetically be called again in Objecter::_throttle_op
> > > > (https://github.com/ceph/ceph/blob/v0.87/src/osdc/Objecter.cc#L2473-L2491),
> > > > if the first calculation returned 0. From diving into the rd.notify
> > > > call in IoCtxImpl.notify
> > > > (https://github.com/ceph/ceph/blob/v0.87/src/librados/IoCtxImpl.cc#L1117),
> > > > I can see that the call adds an op of type CEPH_OSD_OP_NOTIFY
> > > > (https://github.com/ceph/ceph/blob/v0.87/src/osdc/Objecter.h#L865),
> > > > which is defined at
> > > > https://github.com/ceph/ceph/blob/v0.87/src/include/rados.h#L185. From
> > > > that, we know that it's the code path at
> > > > https://github.com/ceph/ceph/blob/v0.87/src/osdc/Objecter.cc#L2463-L2464
> > > > that will be taken while calculating the budget, but from there I
> > > > can't tell where or why there would be extents set on a notify
> > > > operation. I'm not familiar with the Ceph codebase, so that's the
> > > > point where I figured I should ask for some advice about this from
> > > > someone who actually understands this stuff.
> > > >
> > > > I also noticed the possibly related issue #9592
> > > > (http://tracker.ceph.com/issues/9592), but I'm not totally sure if
> > > > it's the same issue, it looks like a pretty different reproduction
> > > > process.
> > > >
> > > > I'm not expecting any bugfixes for such an old version of Ceph, but
> > > > I'd appreciate help just understanding what's different with this
> > > > particular volume and how to clean it up by hand, and in the unlikely
> > > > event that this is a problem in the current development version of
> > > > Ceph, perhaps this can be considered a bug report.