On 4/18/23 22:20, Luís Henriques wrote:
xiubli@xxxxxxxxxx writes:
From: Xiubo Li <xiubli@xxxxxxxxxx>
When trimming the caps and just after the 'session->s_cap_lock' is
released in ceph_iterate_session_caps() the cap maybe removed by
another thread, and when using the stale cap memory in the callbacks
it will trigger use-after-free crash.
We need to check the existence of the cap just after the 'ci->i_ceph_lock'
being acquired. And do nothing if it's already removed.
Your patch seems to be OK, but I'll be honest: the locking is *so* complex
that I can say for sure it really solves any problem :-(
ceph_put_cap() uses mdsc->caps_list_lock to protect the list, but I can't
be sure that holding ci->i_ceph_lock will protect a race in the case
you're trying to solve.
The 'mdsc->caps_list_lock' will protect the members in mdsc:
/*
* Cap reservations
*
* Maintain a global pool of preallocated struct ceph_caps,
referenced
* by struct ceph_caps_reservations. This ensures that we
preallocate
* memory needed to successfully process an MDS response. (If
an MDS
* sends us cap information and we fail to process it, we will have
* problems due to the client and MDS being out of sync.)
*
* Reservations are 'owned' by a ceph_cap_reservation context.
*/
spinlock_t caps_list_lock;
struct list_head caps_list; /* unused (reserved or
unreserved) */
struct list_head cap_wait_list;
int caps_total_count; /* total caps allocated */
int caps_use_count; /* in use */
int caps_use_max; /* max used caps */
int caps_reserve_count; /* unused, reserved */
int caps_avail_count; /* unused, unreserved */
int caps_min_count; /* keep at least this many
Not protecting the cap list in session or inode.
And the racy is between the session's cap list and inode's cap rbtree
and both are holding the same 'cap' reference.
So in 'ceph_iterate_session_caps()' after getting the 'cap' and
releasing the 'session->s_cap_lock', just before passing the 'cap' to
_cb() another thread could continue and release the 'cap'. Then the
'cap' should be stale now and after being passed to _cb() the 'cap' when
dereferencing it will crash the kernel.
And if the 'cap' is stale, it shouldn't exist in the inode's cap rbtree.
Please note the lock order will be:
1, spin_lock(&ci->i_ceph_lock)
2, spin_lock(&session->s_cap_lock)
Before:
ThreadA: ThreadB:
__ceph_remove_caps() -->
spin_lock(&ci->i_ceph_lock)
ceph_remove_cap() --> ceph_iterate_session_caps() -->
__ceph_remove_cap() --> spin_lock(&session->s_cap_lock);
cap = list_entry(p, struct ceph_cap, session_caps);
spin_unlock(&session->s_cap_lock);
spin_lock(&session->s_cap_lock);
// remove it from the session's cap list
list_del_init(&cap->session_caps);
spin_unlock(&session->s_cap_lock);
ceph_put_cap()
trim_caps_cb('cap') --> // the _cb() could be deferred after ThreadA
finished 'ceph_put_cap()'
spin_unlock(&ci->i_ceph_lock) dreference cap->xxx will trigger crash
With this patch:
ThreadA: ThreadB:
__ceph_remove_caps() -->
spin_lock(&ci->i_ceph_lock)
ceph_remove_cap() --> ceph_iterate_session_caps() -->
__ceph_remove_cap() --> spin_lock(&session->s_cap_lock);
cap = list_entry(p, struct ceph_cap, session_caps);
ci_node = &cap->ci_node;
spin_unlock(&session->s_cap_lock);
spin_lock(&session->s_cap_lock);
// remove it from the session's cap list
list_del_init(&cap->session_caps);
spin_unlock(&session->s_cap_lock);
ceph_put_cap()
trim_caps_cb('ci_node') -->
spin_unlock(&ci->i_ceph_lock)
spin_lock(&ci->i_ceph_lock)
cap = rb_entry(ci_node, struct ceph_cap, ci_node); // This is buggy
in this version, we should use the 'mds' instead and I will fix it.
if (!cap) { release the spin lock and return directly }
spin_unlock(&ci->i_ceph_lock)
While we should switch to use the 'mds' of the cap instead of the
'ci_node', which is buggy. I will fix it in next version.
Is the issue in that bugzilla reproducible, or was that a one-time thing?
No, I don't think so. Locally I have tried by turning the mds options to
trigger the cap reclaiming more frequently, but still couldn't reproduce
it. It should be very corner case.
Thanks
- Xiubo
Cheers,