On Tue, 12 Aug 2008, Simon Horman wrote: > On Tue, Aug 12, 2008 at 12:57:21AM +0200, Sven Wegener wrote: > > Both schedulers have a race condition that happens in the following > > situation: > > > > We have an entry in our table that already has expired according to it's > > last use time. Then we need to schedule a new connection that uses this > > entry. > > > > CPU 1 CPU 2 > > > > ip_vs_lblc_schedule() > > ip_vs_lblc_get() > > lock table for read > > find entry > > unlock table > > ip_vs_lblc_check_expire() > > lock table for write > > kfree() expired entry > > unlock table > > return invalid entry > > > > Problem is that we assign the last use time outside of our critical > > region. We can make hitting this race more difficult, if not impossible, > > if we assign the last use time while still holding the lock for reading. > > That gives us six minutes during which it's save to use the entry, which > > should be enough for our use case, as we're going to use it immediately > > and don't keep a long reference to it. > > > > We're holding the lock for reading and not for writing. The last use time > > is an unsigned long, so the assignment should be atomic by itself. And we > > don't care, if some other user sets it to a slightly different value. The > > read_unlock() implies a barrier so that other CPUs see the new last use > > time during cleanup, even if we're just using a read lock. > > > > Other solutions would be: 1) protect the whole ip_vs_lblc_schedule() with > > write_lock()ing the lock, 2) add reference counting for the entries, 3) > > protect each entry with it's own lock. And all are bad for performance. > > > > Comments? Ideas? > > Is there a pathological case here if sysctl_ip_vs_lblc_expiration is > set to be very short and we happen to hit ip_vs_lblc_full_check()? Yes. > To be honest I think that I like the reference count approach best, > as it seems safe and simple. Is it really going to be horrible > for performance? Probably not, I guess the sentence was a bit pessimistic. > If so, I wonder if a workable solution would be to provide a more fine-grained > lock on tbl. Something like the way that ct_read_lock/unlock() works. Also possible. But I guess I was thinking too complicated last night. What I was after with the "protect the whole ip_vs_lblc_schedule() with write_lock()ing the lock" was also to simply prevent someone adding duplicate entries. If we just extend the read_lock() region to cover the whole usage of the entry and do an additional duplicate check during inserting the entry under write_lock(), we fix the issue and also fix the race that someone may add duplicate entries. We have a bit overhead, because we unlock/lock and also there is a chance of doing one or more useless __ip_vs_wlc_schedule(), but in the end this should work. More fine-grained locking could help to lower the impact. Sven -- To unsubscribe from this list: send the line "unsubscribe lvs-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html