On Wed, Aug 13, 2008 at 08:59:32AM +0200, Sven Wegener wrote: > On Wed, 13 Aug 2008, Simon Horman wrote: > > > On Tue, Aug 12, 2008 at 03:07:48PM +0200, Sven Wegener wrote: > > > On Tue, 12 Aug 2008, Sven Wegener wrote: > > > > > > > On Tue, 12 Aug 2008, Simon Horman wrote: > > > > > > > > > On Tue, Aug 12, 2008 at 12:57:21AM +0200, Sven Wegener wrote: > > > > > > Both schedulers have a race condition that happens in the following > > > > > > situation: > > > > > > > > > > > > We have an entry in our table that already has expired according to it's > > > > > > last use time. Then we need to schedule a new connection that uses this > > > > > > entry. > > > > > > > > > > > > CPU 1 CPU 2 > > > > > > > > > > > > ip_vs_lblc_schedule() > > > > > > ip_vs_lblc_get() > > > > > > lock table for read > > > > > > find entry > > > > > > unlock table > > > > > > ip_vs_lblc_check_expire() > > > > > > lock table for write > > > > > > kfree() expired entry > > > > > > unlock table > > > > > > return invalid entry > > > > > > > > > > > > Problem is that we assign the last use time outside of our critical > > > > > > region. We can make hitting this race more difficult, if not impossible, > > > > > > if we assign the last use time while still holding the lock for reading. > > > > > > That gives us six minutes during which it's save to use the entry, which > > > > > > should be enough for our use case, as we're going to use it immediately > > > > > > and don't keep a long reference to it. > > > > > > > > > > > > We're holding the lock for reading and not for writing. The last use time > > > > > > is an unsigned long, so the assignment should be atomic by itself. And we > > > > > > don't care, if some other user sets it to a slightly different value. The > > > > > > read_unlock() implies a barrier so that other CPUs see the new last use > > > > > > time during cleanup, even if we're just using a read lock. > > > > > > > > > > > > Other solutions would be: 1) protect the whole ip_vs_lblc_schedule() with > > > > > > write_lock()ing the lock, 2) add reference counting for the entries, 3) > > > > > > protect each entry with it's own lock. And all are bad for performance. > > > > > > > > > > > > Comments? Ideas? > > > > > > > > > > Is there a pathological case here if sysctl_ip_vs_lblc_expiration is > > > > > set to be very short and we happen to hit ip_vs_lblc_full_check()? > > > > > > > > Yes. > > > > > > > > > To be honest I think that I like the reference count approach best, > > > > > as it seems safe and simple. Is it really going to be horrible > > > > > for performance? > > > > > > > > Probably not, I guess the sentence was a bit pessimistic. > > > > > > > > > If so, I wonder if a workable solution would be to provide a more fine-grained > > > > > lock on tbl. Something like the way that ct_read_lock/unlock() works. > > > > > > > > Also possible. But I guess I was thinking too complicated last night. What > > > > I was after with the "protect the whole ip_vs_lblc_schedule() with > > > > write_lock()ing the lock" was also to simply prevent someone adding > > > > duplicate entries. If we just extend the read_lock() region to cover the > > > > whole usage of the entry and do an additional duplicate check during > > > > inserting the entry under write_lock(), we fix the issue and also fix the > > > > race that someone may add duplicate entries. We have a bit overhead, > > > > because we unlock/lock and also there is a chance of doing one or more > > > > useless __ip_vs_wlc_schedule(), but in the end this should work. More > > > > fine-grained locking could help to lower the impact. > > > > > > I wondered if this whole thing can ever be totally race condition free, > > > without changing how destinations are purged from the trash. Initial > > > version of the patch below. Basically it pushes the locking up into > > > ip_vs_lblc_schedule() and rearranges the code to be safe that we have a > > > valid destination. > > > > Hi Sven, > > > > I think that I am happy with this, except for updating lastuse > > under a read lock, is it really safe? Do you still have any concerns? > > > > I've annotated the code with my thoughts on how your locking is working > > - thinking aloud if you like. > > Your locking description is exactly how it was intented by me. Great, I wanted to make sure that we are both on the same page. > > > diff --git a/net/ipv4/ipvs/ip_vs_lblc.c b/net/ipv4/ipvs/ip_vs_lblc.c > > > index 7a6a319..67f7b04 100644 > > > --- a/net/ipv4/ipvs/ip_vs_lblc.c > > > +++ b/net/ipv4/ipvs/ip_vs_lblc.c > > > @@ -123,31 +123,6 @@ static ctl_table vs_vars_table[] = { > > > > > > static struct ctl_table_header * sysctl_header; > > > > > > -/* > > > - * new/free a ip_vs_lblc_entry, which is a mapping of a destionation > > > - * IP address to a server. > > > - */ > > > -static inline struct ip_vs_lblc_entry * > > > -ip_vs_lblc_new(__be32 daddr, struct ip_vs_dest *dest) > > > -{ > > > - struct ip_vs_lblc_entry *en; > > > - > > > - en = kmalloc(sizeof(struct ip_vs_lblc_entry), GFP_ATOMIC); > > > - if (en == NULL) { > > > - IP_VS_ERR("ip_vs_lblc_new(): no memory\n"); > > > - return NULL; > > > - } > > > - > > > - INIT_LIST_HEAD(&en->list); > > > - en->addr = daddr; > > > - > > > - atomic_inc(&dest->refcnt); > > > - en->dest = dest; > > > - > > > - return en; > > > -} > > > - > > > - > > > static inline void ip_vs_lblc_free(struct ip_vs_lblc_entry *en) > > > { > > > list_del(&en->list); > > > @@ -189,10 +164,8 @@ ip_vs_lblc_hash(struct ip_vs_lblc_table *tbl, struct ip_vs_lblc_entry *en) > > > */ > > > hash = ip_vs_lblc_hashkey(en->addr); > > > > > > - write_lock(&tbl->lock); > > > list_add(&en->list, &tbl->bucket[hash]); > > > atomic_inc(&tbl->entries); > > > - write_unlock(&tbl->lock); > > > > > > return 1; > > > } > > > @@ -209,23 +182,54 @@ ip_vs_lblc_get(struct ip_vs_lblc_table *tbl, __be32 addr) > > > > > > hash = ip_vs_lblc_hashkey(addr); > > > > > > - read_lock(&tbl->lock); > > > - > > > list_for_each_entry(en, &tbl->bucket[hash], list) { > > > if (en->addr == addr) { > > > /* HIT */ > > > - read_unlock(&tbl->lock); > > > return en; > > > } > > > } > > > > > > - read_unlock(&tbl->lock); > > > - > > > return NULL; > > > } > > > > > > > > > /* > > > + * new/free a ip_vs_lblc_entry, which is a mapping of a destionation > > > + * IP address to a server. > > > + */ > > > +static inline struct ip_vs_lblc_entry * > > > +ip_vs_lblc_new(struct ip_vs_lblc_table *tbl, __be32 daddr, > > > + struct ip_vs_dest *dest) > > > +{ > > > + struct ip_vs_lblc_entry *en; > > > + > > > + en = ip_vs_lblc_get(tbl, daddr); > > > + if (!en) { > > > + en = kmalloc(sizeof(*en), GFP_ATOMIC); > > > + if (!en) { > > > + IP_VS_ERR("ip_vs_lblc_new(): no memory\n"); > > > + return NULL; > > > + } > > > + > > > + INIT_LIST_HEAD(&en->list); > > > + en->addr = daddr; > > > + en->lastuse = jiffies; > > > + > > > + atomic_inc(&dest->refcnt); > > > + en->dest = dest; > > > + > > > + ip_vs_lblc_hash(tbl, en); > > > + } else if (en->dest != dest) { > > > + atomic_dec(&en->dest->refcnt); > > > + atomic_inc(&dest->refcnt); > > > + en->dest = dest; > > > + } > > > + > > > + return en; > > > +} > > > + > > > + > > > +/* > > > * Flush all the entries of the specified table. > > > */ > > > static void ip_vs_lblc_flush(struct ip_vs_lblc_table *tbl) > > > @@ -484,7 +488,7 @@ is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc) > > > static struct ip_vs_dest * > > > ip_vs_lblc_schedule(struct ip_vs_service *svc, const struct sk_buff *skb) > > > { > > > - struct ip_vs_dest *dest; > > > + struct ip_vs_dest *dest = NULL; > > > struct ip_vs_lblc_table *tbl; > > > struct ip_vs_lblc_entry *en; > > > struct iphdr *iph = ip_hdr(skb); > > > @@ -492,38 +496,43 @@ ip_vs_lblc_schedule(struct ip_vs_service *svc, const struct sk_buff *skb) > > > IP_VS_DBG(6, "ip_vs_lblc_schedule(): Scheduling...\n"); > > > > > > tbl = (struct ip_vs_lblc_table *)svc->sched_data; > > > + /* First look in our cache */ > > > + read_lock(&tbl->lock); > > > en = ip_vs_lblc_get(tbl, iph->daddr); > > > - if (en == NULL) { > > > - dest = __ip_vs_wlc_schedule(svc, iph); > > > - if (dest == NULL) { > > > - IP_VS_DBG(1, "no destination available\n"); > > > - return NULL; > > > - } > > > - en = ip_vs_lblc_new(iph->daddr, dest); > > > - if (en == NULL) { > > > - return NULL; > > > - } > > > - ip_vs_lblc_hash(tbl, en); > > > - } else { > > > + if (en) { > > > dest = en->dest; > > > - if (!(dest->flags & IP_VS_DEST_F_AVAILABLE) > > > - || atomic_read(&dest->weight) <= 0 > > > - || is_overloaded(dest, svc)) { > > > - dest = __ip_vs_wlc_schedule(svc, iph); > > > - if (dest == NULL) { > > > - IP_VS_DBG(1, "no destination available\n"); > > > - return NULL; > > > - } > > > - atomic_dec(&en->dest->refcnt); > > > - atomic_inc(&dest->refcnt); > > > - en->dest = dest; > > > - } > > > + en->lastuse = jiffies; > > > > Setting en->lastuse with a read lock held makes me feel uncomfortable. > > Are you sure it is safe? > > It looks suspicous, but word-size writes should be atomic by itself. So > even doing it under read lock should not lead to corrupted values. We > don't care if the values that are set concurrently are overwritten, > because they should only differ in a couple of jiffies. Might be good to > have someone other comment on the issue too. Ok. I was only worried about the atomicity. I agree that having the value rewritten and potentially slightly inaccurate is not a concern. > > > + > > > + /* > > > + * If the destination is not available, i.e. it's in the trash, > > > + * ignore it, as it may be removed from under our feet > > > + */ > > > + > > > + if (!(dest->flags & IP_VS_DEST_F_AVAILABLE)) > > > + dest = NULL; > > > } > > > - en->lastuse = jiffies; > > > > At this point we know that dest exists and is not in the trash. > > We know that dest can't disapear given that its not already > > in the trash and our caller holds a read lock on __ip_vs_svc_lock. > > So dest will be safe until after ip_vs_lblc_schedule() returns. > > > > Dest seems ok :-) > > > > Ok, that seems complex but non-racy to me :-) > > > > Perhaps a longer comment would be in order. > > Yeah, that's the point where we prevent the race with the trash purging. > We could change the purging of destinations from the trash to be under > __ip_vs_svc_lock write locked, then we know that all destinations, even > the ones in the trash are valid. Might make more sense than duplicating > this logic in other schedulers. That sounds like a good way to simplify things. > > > + read_unlock(&tbl->lock); > > > > After the lock is released we may now lose en, but we don't care > > as its not used again inside ip_vs_lblc_schedule(). > > > > So far so good :-) > > > > > + > > > + /* If the destination has a weight and is not overloaded, use it */ > > > + if (dest && atomic_read(&dest->weight) > 0 && !is_overloaded(dest, svc)) > > > + goto out; > > > + > > > + /* No cache entry or it is invalid, time to schedule */ > > > + dest = __ip_vs_wlc_schedule(svc, iph); > > > + if (!dest) { > > > + IP_VS_DBG(1, "no destination available\n"); > > > + return NULL; > > > + } > > > > This new dest must also not be in the trash by virtue of > > being returned by __ip_vs_wlc_schedule() and delitions from > > svc->destinations being protected by __ip_vs_svc_lock, which > > is held by the caller. > > > > Also good :-) > > > > BTW, the name of __ip_vs_wlc_schedule, really ought to be changed > > to __ip_vs_lblc_schedule. I'll make a trivial patch for that. > > > > > + > > > + /* If we fail to create a cache entry, we'll just use the valid dest */ > > > + write_lock(&tbl->lock); > > > + ip_vs_lblc_new(tbl, iph->daddr, dest); > > > + write_unlock(&tbl->lock); > > > > This write lock looks obviously correct as the destructors > > do the same thing. > > > > > +out: > > > IP_VS_DBG(6, "LBLC: destination IP address %u.%u.%u.%u " > > > "--> server %u.%u.%u.%u:%d\n", > > > - NIPQUAD(en->addr), > > > + NIPQUAD(iph->daddr), > > > NIPQUAD(dest->addr), > > > ntohs(dest->port)); > > > -- To unsubscribe from this list: send the line "unsubscribe lvs-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html