Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Tejun Heo <tj@xxxxxxxxxx> · Wed, 01 Jul 2009 21:53:06 +0900

Hello, Andi.

Andi Kleen wrote:
> On Wed, Jul 01, 2009 at 07:21:57PM +0900, Tejun Heo wrote:
>>> using possible per cpu data I picked in current git: icmp.c
>> I was talking about percpu allocator proper.  Yeap, the major work
>> would be in auditing and converting for_each_possible_cpu() users.
> 
> and testing. that's the hard part. cpu hotplug is normally not well
> tested. Any code change that requires lots of new code for 
> it will be a problem because that code will then likely bitrot.

It would be nice to have something to test cpu on/offlining
automatically.  Something which keeps bringing cpus up and down as the
system goes through stress testing.

[--snip--]
>> too difficult.  I wouldn't know for sure before I actually try tho.
> 
> I think it's clear that you haven't tried yet :)

No I haven't yet.  I had a pretty good idea about how to implement it
in percpu allocator but haven't really looked at the users.  So, yeap,
it's quite possible that I'm under estimating the problem.  Oh well,
let's see.  Thanks for the warnings.

> I wrote quite a few per cpu callback handlers over the years and
> in my experience they are all nasty code with subtle races. The problem
> is that instead of having a single subset init function which
> is just single threaded and doesn't need to worry about races
> you now have multi threaded init, which tends to be a can of worms.

I tried a couple (didn't end up sending them out) and yeah they could
be quite painful.  Bringing up part usually isn't as painful as the
other way tho.

> I think a far saner strategy than rewriting every user of DEFINE_PER_CPU,
> ending up with lots of badly tested code is to:

But I don't think it would be that drastic.  Most users are quite
simple.

> - Fix the few large size percpu pigs that are problematic today to
> allocate in a callback.
> - Then once the per cpu data in all configurations is <200k (better
> <100 in the non debug builds) again just keep pre-allocating like we
> always did
> - Possibly adjust the vmalloc area on 32bit based on the possible
> CPU map at the cost of the direct mapping, to make sure there's
> always enough mapping space.

I think it's something we eventually need to do.  There already are
cases where lack of scalable and performant percpu allocation leads to
design restrictions and between many core cpus and virtualization the
requirements are becoming more varied.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html