Re: [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Sat, 16 Oct 2010 02:07:44 -0700

On Sat, 16 Oct 2010 10:29:08 +0200 Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote:

> Le samedi 16 octobre 2010 __ 18:55 +1100, Nick Piggin a __crit :
> > On Thu, Sep 30, 2010 at 04:10:39PM +1000, Dave Chinner wrote:
> > > On Wed, Sep 29, 2010 at 09:53:22PM -0700, Andrew Morton wrote:
> > > > On Wed, 29 Sep 2010 22:18:48 +1000 Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > 
> > > > > From: Eric Dumazet <dada1@xxxxxxxxxxxxx>
> > > > > 
> > > > > The number of inodes allocated does not need to be tied to the
> > > > > addition or removal of an inode to/from a list. If we are not tied
> > > > > to a list lock, we could update the counters when inodes are
> > > > > initialised or destroyed, but to do that we need to convert the
> > > > > counters to be per-cpu (i.e. independent of a lock). This means that
> > > > > we have the freedom to change the list/locking implementation
> > > > > without needing to care about the counters.
> > > > > 
> > > > >
> > > > > ...
> > > > >
> > > > > +int get_nr_inodes(void)
> > > > > +{
> > > > > +	int i;
> > > > > +	int sum = 0;
> > > > > +	for_each_possible_cpu(i)
> > > > > +		sum += per_cpu(nr_inodes, i);
> > > > > +	return sum < 0 ? 0 : sum;
> > > > > +}
> > > > 
> > > > This reimplements percpu_counter_sum_positive(), rather poorly
> > 
> > Why is it poorly?
> 
> Nick
> 
> Some people believe percpu_counter object is the right answer to such
> distributed counters, because the loop is done on 'online' cpus instead
> of 'possible' cpus. "It must be better if number of possible cpus is
> 4096 and only one or two cpus are online"...
> 
> But if we do this loop only on rare events, like
> "cat /proc/sys/fs/inode-nr", then the percpu_counter() is more
> expensive, because percpu_add() _is_ more expensive :
> 
> - Its a function call and lot of instructions/cycles per call, while
> this_cpu_inc(nr_inodes) is a single instruction, using no register on
> x86.

You want an inlined percpu_counter_inc() then write one!  Bonus points
for writing this_cpu_add_return() and doing it without a
preempt_disable().  It collapses to just a few instructions.

It's extremely poor form to say "oh X sucks so I'm going to implement
my own" without first addressing why X allegedly sucks.

> - Its possibly accessing a shared spinlock and counter when the percpu
> counter reaches the batch limit.

That's in the noise foor.

> To recap : nr_inodes is not a counter that needs to be estimated in real
> time, since we have not limit on number of inodes in the machine (limit
> is the memory allocator).
> 
> Unless someone can prove "cat /proc/sys/fs/inode-nr" must be performed
> thousand of times per second on their setup, the choice I made to scale
> nr_inodes is better over the 'obvious percpu_counter choice'
> 
> This choice was made to scale some counters in network stack some years
> ago, and this rocks.

And we get open-coded reimplementations of the same damn thing all over
the tree.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html