Ilya Dryomov <idryomov@xxxxxxxxx> writes: > On Wed, Apr 20, 2023 at 3:22 PM Luís Henriques <lhenriques@xxxxxxx> wrote: >> >> Xiubo Li <xiubli@xxxxxxxxxx> writes: >> >> > On 4/18/23 22:20, Luís Henriques wrote: >> >> xiubli@xxxxxxxxxx writes: >> >> >> >>> From: Xiubo Li <xiubli@xxxxxxxxxx> >> >>> >> >>> When trimming the caps and just after the 'session->s_cap_lock' is >> >>> released in ceph_iterate_session_caps() the cap maybe removed by >> >>> another thread, and when using the stale cap memory in the callbacks >> >>> it will trigger use-after-free crash. >> >>> >> >>> We need to check the existence of the cap just after the 'ci->i_ceph_lock' >> >>> being acquired. And do nothing if it's already removed. >> >> Your patch seems to be OK, but I'll be honest: the locking is *so* complex >> >> that I can say for sure it really solves any problem :-( >> >> >> >> ceph_put_cap() uses mdsc->caps_list_lock to protect the list, but I can't >> >> be sure that holding ci->i_ceph_lock will protect a race in the case >> >> you're trying to solve. >> > >> > The 'mdsc->caps_list_lock' will protect the members in mdsc: >> > >> > /* >> > * Cap reservations >> > * >> > * Maintain a global pool of preallocated struct ceph_caps, referenced >> > * by struct ceph_caps_reservations. This ensures that we preallocate >> > * memory needed to successfully process an MDS response. (If an MDS >> > * sends us cap information and we fail to process it, we will have >> > * problems due to the client and MDS being out of sync.) >> > * >> > * Reservations are 'owned' by a ceph_cap_reservation context. >> > */ >> > spinlock_t caps_list_lock; >> > struct list_head caps_list; /* unused (reserved or >> > unreserved) */ >> > struct list_head cap_wait_list; >> > int caps_total_count; /* total caps allocated */ >> > int caps_use_count; /* in use */ >> > int caps_use_max; /* max used caps */ >> > int caps_reserve_count; /* unused, reserved */ >> > int caps_avail_count; /* unused, unreserved */ >> > int caps_min_count; /* keep at least this many >> > >> > Not protecting the cap list in session or inode. >> > >> > >> > And the racy is between the session's cap list and inode's cap rbtree and both >> > are holding the same 'cap' reference. >> > >> > So in 'ceph_iterate_session_caps()' after getting the 'cap' and releasing the >> > 'session->s_cap_lock', just before passing the 'cap' to _cb() another thread >> > could continue and release the 'cap'. Then the 'cap' should be stale now and >> > after being passed to _cb() the 'cap' when dereferencing it will crash the >> > kernel. >> > >> > And if the 'cap' is stale, it shouldn't exist in the inode's cap rbtree. Please >> > note the lock order will be: >> > >> > 1, spin_lock(&ci->i_ceph_lock) >> > >> > 2, spin_lock(&session->s_cap_lock) >> > >> > >> > Before: >> > >> > ThreadA: ThreadB: >> > >> > __ceph_remove_caps() --> >> > >> > spin_lock(&ci->i_ceph_lock) >> > >> > ceph_remove_cap() --> ceph_iterate_session_caps() --> >> > >> > __ceph_remove_cap() --> spin_lock(&session->s_cap_lock); >> > >> > cap = list_entry(p, struct ceph_cap, session_caps); >> > >> > spin_unlock(&session->s_cap_lock); >> > >> > spin_lock(&session->s_cap_lock); >> > >> > // remove it from the session's cap list >> > >> > list_del_init(&cap->session_caps); >> > >> > spin_unlock(&session->s_cap_lock); >> > >> > ceph_put_cap() >> > >> > trim_caps_cb('cap') --> // the _cb() could be deferred after ThreadA finished >> > 'ceph_put_cap()' >> > >> > spin_unlock(&ci->i_ceph_lock) dreference cap->xxx will trigger crash >> > >> > >> > >> > With this patch: >> > >> > ThreadA: ThreadB: >> > >> > __ceph_remove_caps() --> >> > >> > spin_lock(&ci->i_ceph_lock) >> > >> > ceph_remove_cap() --> ceph_iterate_session_caps() --> >> > >> > __ceph_remove_cap() --> spin_lock(&session->s_cap_lock); >> > >> > cap = list_entry(p, struct ceph_cap, session_caps); >> > >> > ci_node = &cap->ci_node; >> > >> > spin_unlock(&session->s_cap_lock); >> > >> > spin_lock(&session->s_cap_lock); >> > >> > // remove it from the session's cap list >> > >> > list_del_init(&cap->session_caps); >> > >> > spin_unlock(&session->s_cap_lock); >> > >> > ceph_put_cap() >> > >> > trim_caps_cb('ci_node') --> >> > >> > spin_unlock(&ci->i_ceph_lock) >> > >> > spin_lock(&ci->i_ceph_lock) >> > >> > cap = rb_entry(ci_node, struct ceph_cap, ci_node); // This is buggy in this >> > version, we should use the 'mds' instead and I will fix it. >> > >> > if (!cap) { release the spin lock and return directly } >> > >> > spin_unlock(&ci->i_ceph_lock) >> >> Thanks a lot for taking the time to explain all of this. Much >> appreciated. It all seems to make sense, and, again, I don't have any >> real objection to your patch. It's just that I still find the whole >> locking to be too complex, and every change that is made to it looks like >> walking on a mine field :-) >> >> > While we should switch to use the 'mds' of the cap instead of the 'ci_node', >> > which is buggy. I will fix it in next version. >> >> Yeah, I've took a quick look at v4 and it looks like it fixes this. > > Hi Luís, > > Do you mind if I put this down as a Reviewed-by? ;) Sure, feel free to add my Reviewed-by: Luís Henriques <lhenriques@xxxxxxx> (Sorry, forgot to send that explicitly.) Cheers, -- Luís