Re: rbd_open_by_id crash when connection timeout

Dongsheng Yang <dongsheng.yang@xxxxxxxxxxxx> · Sat, 7 Dec 2019 18:31:16 +0800

在 12/7/2019 3:56 PM, yangjun@xxxxxxxxxxxxxxxxxxxx 写道:
Hi dongsheng
https://tracker.ceph.com/issues/43178，; Can I open a PR to fix it like 
you said？Thanks.

Sure, feel free to open a PR to fix it.

------------------------------------------------------------------------
yangjun@xxxxxxxxxxxxxxxxxxxx

    *From:* Jason Dillaman <mailto:jdillama@xxxxxxxxxx>
    *Date:* 2019-12-06 22:58
    *To:* Dongsheng Yang <mailto:dongsheng.yang@xxxxxxxxxxxx>
    *CC:* yangjun@xxxxxxxxxxxxxxxxxxxx
    <mailto:yangjun@xxxxxxxxxxxxxxxxxxxx>; ceph-users
    <mailto:ceph-users@xxxxxxx>
    *Subject:* Re: rbd_open_by_id crash when connection timeout
    On Fri, Dec 6, 2019 at 9:51 AM Dongsheng Yang
    <dongsheng.yang@xxxxxxxxxxxx> wrote:
    >
    >
    > 在 12/6/2019 9:46 PM, Jason Dillaman 写道:
    > > On Fri, Dec 6, 2019 at 12:12 AM Dongsheng Yang
    > > <dongsheng.yang@xxxxxxxxxxxx> wrote:
    > >>
    > >>
    > >> On 12/06/2019 12:50 PM, yangjun@xxxxxxxxxxxxxxxxxxxx wrote:
    > >>
    > >> Hi Jason, dongsheng
    > >> I found a problem using rbd_open_by_id when connection
    timeout(errno = 110, ceph version 12.2.8,  there is no change
    about rbd_open_by_id in master branch).
    > >> int r = ictx->state->open(false);
    > >> if (r < 0) {      //  r = -110
    > >>     delete ictx;   // crash，the stack is shown below：
    > >> } else {
    > >>     *image = (rbd_image_t)ictx;
    > >> }
    > >> Stack is ：
    > >> (gdb) bt
    > >> Program terminated with signal 11, Segmentation fault.
    > >> #0  Mutex::Lock (this=this@entry=0x8,
    no_lockdep=no_lockdep@entry=false)
    > >>      at
    /var/lib/jenkins/workspace/ceph_L/ceph/src/common/Mutex.cc:92
    > >> #1  0x0000ffff765eb814 in Locker (m=..., this=<synthetic
    pointer>)
    > >>      at
    /var/lib/jenkins/workspace/ceph_L/ceph/src/common/Mutex.h:115
    > >> #2  PerfCountersCollection::remove (this=0x0,
    l=0xffff85bf0588 <main_arena+88>)
    > >>      at
    /var/lib/jenkins/workspace/ceph_L/ceph/src/common/perf_counters.cc:62
    > >> #3  0x0000ffff757f26c4 in librbd::ImageCtx::perf_stop
    (this=0x2065d2f0)
    > >>      at
    /var/lib/jenkins/workspace/ceph_L/ceph/src/librbd/ImageCtx.cc:426
    > >> #4  0x0000ffff757f6e48 in librbd::ImageCtx::~ImageCtx
    (this=0x2065d2f0, __in_chrg=<optimized out>)
    > >>      at
    /var/lib/jenkins/workspace/ceph_L/ceph/src/librbd/ImageCtx.cc:239
    > >> #5  0x0000ffff757e8e70 in rbd_open_by_id (p=p@entry=0x2065e780,
    > >>      id=id@entry=0xffff85928bb4 "20376aa706c0",
    image=image@entry=0xffff75b9b0b8,
    > >>      snap_name=snap_name@entry=0x0) at
    /var/lib/jenkins/workspace/ceph_L/ceph/src/librbd/librbd.cc:2692
    > >> #6  0x0000ffff75aed524 in __pyx_pf_3rbd_5Image___init__ (
    > >>      __pyx_v_read_only=0xffff85f02bb8 <_Py_ZeroStruct>,
    __pyx_v_snapshot=0xffff85f15938 <_Py_NoneStruct>,
    > >>      __pyx_v_by_id=0xffff85f02ba0 <_Py_TrueStruct>,
    __pyx_v_name=0xffff85928b90,
    > >>      __pyx_v_ioctx=0xffff7f0283d0, __pyx_v_self=0xffff75b9b0a8)
    > >>      at
    /var/lib/jenkins/workspace/ceph_L/ceph/build/src/pybind/rbd/pyrex/rbd.c:12662
    > >> #7  __pyx_pw_3rbd_5Image_1__init__
    (__pyx_v_self=0xffff75b9b0a8, __pyx_args=<optimized out>,
    > >>      __pyx_kwds=<optimized out>)
    > >>      at
    /var/lib/jenkins/workspace/ceph_L/ceph/build/src/pybind/rbd/pyrex/rbd.c:12378
    > >> #8  0x0000ffff85e02924 in type_call () from
    /lib64/libpython2.7.so.1.0
    > >> #9  0x0000ffff85dac254 in PyObject_Call () from
    /lib64/libpython2.7.so.1.0
    > > I suspect you will hit tons of errors and corruption and
    leaking RADOS
    > > objects when you start injecting RADOS timeouts. It's not
    something
    > > librbd really supports or tests against -- so you need to
    expect the
    > > unexpected. Definitely feel free to open a PR to fix it,
    though -- but
    > > otherwise it would be a really *low* priority to fix.
    > >
    > >> Meanwhile, I have sone questions about rbd_open, if
    state->open failed（return value r < 0）, Is there a risk of memory
    leaks about ictx?
    > >> extern "C" int rbd_open(rados_ioctx_t p, const char *name,
    rbd_image_t *image,
    > >> const char *snap_name)
    > >> {
    > >>    librados::IoCtx io_ctx;
    > >>    librados::IoCtx::from_rados_ioctx_t(p, io_ctx);
    > >>
    TracepointProvider::initialize<tracepoint_traits>(get_cct(io_ctx));
    > >>    librbd::ImageCtx *ictx = new librbd::ImageCtx(name, "",
    snap_name, io_ctx,
    > >> false);
    > >>    tracepoint(librbd, open_image_enter, ictx,
    ictx->name.c_str(), ictx->id.c_str(), ictx->snap_name.c_str(),
    ictx->read_only);
    > >>
    > >>    int r = ictx->state->open(false);
    > >>    if (r >= 0) {     // if r < 0, Is there a risk of memory
    leaks?
    > >>
    > >>
    > >> Good catch. I think there is. And same at rbd_open_read_only.
    Seems we need a patch to fix them.
    > > No, that's not a memory leak. The blocking version of
    > > "ImageState::open" will free the "ImageCtx" memory upon failure.
    >
    >
    > My bad, I did not look it carefully before repling this mail,
    >
    > I realized that we delete ctx in rbd_open_by_id but not in
    rbd_open, so
    > I thought it's a memory leak
    >
    > in rbd_open. But I was wrong as you mentioned, it is released in
    > state->open(). So we dont need to add
    >
    > a delete in rbd_open(), on the contrary, we need to remove the
    "delete
    > ictx" to avoid double-free problem as below:
    >
    >
    > diff --git a/src/librbd/librbd.cc b/src/librbd/librbd.cc
    > index 6c303c9..b518fad 100644
    > --- a/src/librbd/librbd.cc
    > +++ b/src/librbd/librbd.cc
    > @@ -3924,9 +3924,7 @@ extern "C" int rbd_open_by_id(rados_ioctx_t p,
    > const char *id,
    >                ictx->id.c_str(), ictx->snap_name.c_str(),
    ictx->read_only);
    >
    >     int r = ictx->state->open(0);
    > -  if (r < 0) {
    > -    delete ictx;
    > -  } else {
    > +  if (r >= 0) {
    >       *image = (rbd_image_t)ictx;
    >     }
    >     tracepoint(librbd, open_image_exit, r);
    > @@ -3999,9 +3997,7 @@ extern "C" int
    > rbd_open_by_id_read_only(rados_ioctx_t p, const char *id,
    >                ictx->id.c_str(), ictx->snap_name.c_str(),
    ictx->read_only);
    >
    >     int r = ictx->state->open(0);
    > -  if (r < 0) {
    > -    delete ictx;
    > -  } else {
    > +  if (r >= 0) {
    >       *image = (rbd_image_t)ictx;
    >     }
    >
    >     tracepoint(librbd, open_image_exit, r);
    >
    >
    > There is a reproduce test.c attached.
    Opened https://tracker.ceph.com/issues/43178
    > Thanx
    >
    > Yang
    >
    > >
    > >> Thanx
    > >> Yang
    > >>
    > >>      *image = (rbd_image_t)ictx;
    > >>    }
    > >>    tracepoint(librbd, open_image_exit, r);
    > >>    return r;
    > >> }
    > >>
    > >> Can you give me some advice? many thanks.
    > >>
    > >>
    > >> ________________________________
    > >> yangjun@xxxxxxxxxxxxxxxxxxxx
    > >>
    > >>
    > >
    Thanks,
    -- 
    Jason

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx