[ Re-send to make it through the vger filters; sorry! ] Hmm, yeah. The ticket and failure mode makes me wonder if something has gotten so strange with this image that the notify's bufferlist actually exceeded a reasonable size, but I don't really see a mechanism for that. What snapshots exist on the pool? Can you successfully examine it in other ways with the rbd image manipulation tools? On Fri, Oct 26, 2018 at 3:48 PM Simon Ruggier <simon@xxxxxxxxxxx> wrote: > > First of all, thanks for your reply. > > Yeah, this is happening within the process executing the rbd command. > Sorry I didn't include the backtrace in my original email, I > completely forgot after putting together the rest of it. > > I set "debug objecter = 20" in the local ceph config file on the > system I ran these commands on, then ran rbd snap create, snap ls, and > snap rm, so you could look at debug output from any of those > three. I saved the entire session, anonymized all names in the output, > and compressed it. See attached. If you need any other information, > let me know and I'll collect it when I'm able to. > On Fri, Oct 26, 2018 at 5:19 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > > > This is happening on the client side? Can you provide the full > > backtrace and a log with "debug objecter = 20" turned on? > > > > On Sun, Oct 21, 2018 at 11:25 AM Simon Ruggier <simon@xxxxxxxxxxx> wrote: > > > > > > Hi, I'm writing about a problem I'm seeing in a Ceph 0.87 cluster > > > where rbd snap create, rm, etc. are succeeding, but aborting with a > > > non-zero return code because the notify call at the very end of the > > > function (https://github.com/ceph/ceph/blob/v0.87/src/librbd/internal.cc#L468) > > > is hitting an assertion failure (Throttle.cc: 194: FAILED assert(c >= > > > 0)). > > > > > > I did a bit of digging, and found that c is calculated in > > > calc_op_budget (https://github.com/ceph/ceph/blob/v0.87/src/osdc/Objecter.cc#L2453-L2471), > > > which is called in Objecter::_take_op_budget > > > (https://github.com/ceph/ceph/blob/v0.87/src/osdc/Objecter.h#L1597-L1608), > > > but could hypothetically be called again in Objecter::_throttle_op > > > (https://github.com/ceph/ceph/blob/v0.87/src/osdc/Objecter.cc#L2473-L2491), > > > if the first calculation returned 0. From diving into the rd.notify > > > call in IoCtxImpl.notify > > > (https://github.com/ceph/ceph/blob/v0.87/src/librados/IoCtxImpl.cc#L1117), > > > I can see that the call adds an op of type CEPH_OSD_OP_NOTIFY > > > (https://github.com/ceph/ceph/blob/v0.87/src/osdc/Objecter.h#L865), > > > which is defined at > > > https://github.com/ceph/ceph/blob/v0.87/src/include/rados.h#L185. From > > > that, we know that it's the code path at > > > https://github.com/ceph/ceph/blob/v0.87/src/osdc/Objecter.cc#L2463-L2464 > > > that will be taken while calculating the budget, but from there I > > > can't tell where or why there would be extents set on a notify > > > operation. I'm not familiar with the Ceph codebase, so that's the > > > point where I figured I should ask for some advice about this from > > > someone who actually understands this stuff. > > > > > > I also noticed the possibly related issue #9592 > > > (http://tracker.ceph.com/issues/9592), but I'm not totally sure if > > > it's the same issue, it looks like a pretty different reproduction > > > process. > > > > > > I'm not expecting any bugfixes for such an old version of Ceph, but > > > I'd appreciate help just understanding what's different with this > > > particular volume and how to clean it up by hand, and in the unlikely > > > event that this is a problem in the current development version of > > > Ceph, perhaps this can be considered a bug report.