On 2/28/25 13:07, Nikolay Aleksandrov wrote: > On 2/28/25 12:31, Cosmin Ratiu wrote: >> On Fri, 2025-02-28 at 02:20 +0000, Hangbin Liu wrote: >>> On Thu, Feb 27, 2025 at 03:31:01PM +0200, Nikolay Aleksandrov wrote: >>>>>> One more thing - note I'm not an xfrm expert by far but it >>>>>> seems to me here you have >>>>>> to also call xdo_dev_state_free() with the old active slave >>>>>> dev otherwise that will >>>>>> never get called with the original real_dev after the switch to >>>>>> a new >>>>>> active slave (or more accurately it might if the GC runs >>>>>> between the switching >>>>>> but it is a race), care must be taken wrt sequence of events >>>>>> because the XFRM >>>>> >>>>> Can we just call xs->xso.real_dev->xfrmdev_ops- >>>>>> xdo_dev_state_free(xs) >>>>> no matter xs->xso.real_dev == real_dev or not? I'm afraid calling >>>>> xdo_dev_state_free() every where may make us lot more easily. >>>>> >>>> >>>> You'd have to check all drivers that implement the callback to >>>> answer that and even then >>>> I'd stick to the canonical way of how it's done in xfrm and make >>>> the bond just passthrough. >>>> Any other games become dangerous and new code will have to be >>>> carefully reviewed every >>>> time, calling another device's free_sa when it wasn't added before >>>> doesn't sound good. >>>> >>>>>> GC may be running in parallel which probably means that in >>>>>> bond_ipsec_free_sa() >>>>>> you'll have to take the mutex before calling >>>>>> xdo_dev_state_free() and check >>>>>> if the entry is still linked in the bond's ipsec list before >>>>>> calling the free_sa >>>>>> callback, if it isn't then del_sa_all got to it before the GC >>>>>> and there's nothing >>>>>> to do if it also called the dev's free_sa callback. The check >>>>>> for real_dev doesn't >>>>>> seem enough to protect against this race. >>>>> >>>>> I agree that we need to take the mutex before calling >>>>> xdo_dev_state_free() >>>>> in bond_ipsec_free_sa(). Do you think if this is enough? I'm a >>>>> bit lot here. >>>>> >>>>> Thanks >>>>> Hangbin >>>> >>>> Well, the race is between the xfrm GC and del_sa_all, in bond's >>>> free_sa if you >>>> walk the list under the mutex before calling real_dev's free >>>> callback and >>>> don't find the current element that's being freed in free_sa then >>>> it was >>>> cleaned up by del_sa_all, otherwise del_sa_all is waiting to walk >>>> that >>>> list and clean the entries. I think it should be fine as long as >>>> free_sa >>>> was called once with the proper device. >>> >>> OK, so the free will be called either in del_sa_all() or free_sa(). >>> Something like this? >>> >> [...] >> >> Unfortunately, after applying these changes and reasoning about them >> for a bit, I don't think this will work. There are still races left. >> For example: >> 1. An xs is marked DEAD (in __xfrm_state_delete, with x->lock held) and >> before .xdo_dev_state_delete() is called on it, bond_ipsec_del_sa_all >> is called in parallel, doesn't call delete on xs (because it's dead), >> then calls free (incorrect without delete first), then removes the list >> entry. Later, xdo_dev_state_delete( == bond_ipsec_del_sa) is called, >> and calls delete (incorrect, out of order with free). Finally, >> bond_ipsec_free_sa is called, which fortunately doesn't do anything >> silly in the new proposed form because xs is no longer in the list. >> >> 2. A more sinister form of the above race can happen when >> bond_ipsec_del_sa_all() calls delete on real_dev, then in parallel and >> immediately after __xfrm_state_delete marks xs as DEAD and calls >> bond_ipsec_del_sa() which happily calls delete on real_dev again. >> >> In order to fix these races (and others like it), I think >> bond_ipsec_del_sa_all and bond_ipsec_add_sa_all *need* to acquire x- >>> lock for each xs being processed. This would prevent xfrm from >> concurrently initiating add/delete operations on the managed states. >> >> Cosmin. > > Duh, right you are. The state is protected by x->lock and cannot be trusted > outside of it. If you take x->lock inside the list walk with the mutex held > you can deadlock. > > Cheers, > Nik > Correction - actually took a closer look at the xfrm code and it should be fine. The x->lock is taken only in the delete path and if the mutex is not acquired by bond's del_sa callback it should be ok. Though this must be very well documented.