Hey Julian,
Thank you for your quick and thoughtful response :)
On 2/23/15 11:30 PM, Julian Anastasov wrote:
We have a custom scheduler that is failing some atomic allocations
You can use GFP_KERNEL from add_dest method.
That should definitely improve things for us.
Can a new create_dest(svc, bool for_update) method
solve such problem? Such modifications are serialized, so
create_dest can attach new structure to sched_data just before
the add_dest/upd_dest call. IPVS does not clone dests on
update but scheduler may need this for RCU update.
create_dest should be called early in __ip_vs_update_dest.
Note that kfree(dest) is not enough, see ip_vs_dest_free(),
more fields may be allocated before __ip_vs_update_dest is
reached.
More context on the custom scheduler:
It's a consistent hash ring algorithm which requires sorting all of the
members and then inserting them one at a time (a entry per weight). It
requires a massive allocation to load balance to the number of machines
we need to.
Right here I'm going to admit that this is not ideal. This is somewhat
"inherited," but I'm currently stuck solving this problem so I'll make
the best of it :)
So now we're in this position where adds, deletes, and updates all can
force us to recalculate this giant hash ring. We can either simply lock
the ring (but this actually takes long enough with a large number of
real servers with large weights that it causes soft lock up detection!)
or do RCU (a new allocation every time). We're trying out the RCU thing
now which is where this hurt us.
One aside, if it's safe to call synchronize_rcu in that path, then it
may solve our most painful allocation problem in part.
tl;dr; it's an implementation in which you allocate on add, update, and
delete.
Patch 3:
In the __ip_vs_update_dest change you are missing
that ip_vs_rs_hash was called. In fact, I think we
should call __ip_vs_unlink_dest(svc, dest, 0) and
__ip_vs_del_dest(svc->net, dest, false)
from ip_vs_add_dest() when both __ip_vs_update_dest(, 1)
and ip_vs_new_dest() fail. This should work because
add_dest call is the last thing we do on ADD, so
the DEL operation but without calling del_dest should
be ok. Such change invalidates Patch 1.
Why above solution is preferred:
At first look, scheduler that can return error from add_dest
should not provide this dest to RCU readers via its schedule
method and kfree(dest) looks ok.
The problem is with the ip_vs_has_real_service() call
and the ip_vs_rs_hash() call we do before add_dest,
dest still can be accessed by RCU readers.
For this reason we can not just kfree(dest).
Pending the rest of your reply I will reimplement it in this way.
Patch 4:
As for del_dest result, failure should not happen,
we should be able to flush dests on module cleanup. I think, there
should not be any allocations in this case.
As mentioned above, we may need to reallocate in this path. But it does
offend my sensibilities that this is the case, so I think it'd be fair
to change the del_dest function to return void and we'd just find
another way to do it.
Thanks,
--
Alex Gartrell <agartrell@xxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe lvs-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html