Re: About struct blockgroup_lock

Simon Kitching <simon.kitching@xxxxxxxxxx> · Sun, 16 May 2010 20:45:21 +1200

On Sat, 2010-05-15 at 22:58 +0800, Ryan Wang wrote:
> Hi,
> 
>      Recently I was reading some code about filesystems, and I found
> the usage of struct blockgroup_lock is confusing. I have 3 questions
> about it, and please help give me some instuctions on them.
> 
> http://lxr.linux.no/#linux+v2.6.32/include/linux/blockgroup_lock.h
> 
> 1)  What is the relationship between NR_CPUS and NR_BG_LOCKS ?
> 2)  And how to understand the usage of struct blockgroup_lock?
> 3)  What is the relationship between block_group and NR_BG_LOCKS, e.g.
> in blk_lock_ptr?
> 
>   56static inline spinlock_t *
>   57bgl_lock_ptr(struct blockgroup_lock *bgl, unsigned int block_group)
>   58{
>   59        return &bgl->locks[(block_group) & (NR_BG_LOCKS-1)].lock;
>   60}

I'm no expert, but your question sounded interesting so I took a look.

>From my reading of the code, the BG_LOCKS stuff is a "heuristic" to
improve performance on multi-CPU systems.

An ext2/ext3 system keeps information about the free (unused) blocks in
the filesystem. When a thread calls into the filesystem
to allocate a new block for a file (or allocate an inode, or release an
existing used block back onto the freelist), it must modify freelist
info. And to avoid race conditions, it must take a lock before modifying
the structure.

Presumably, on a system running multiple threads this operation can be
contended, ie throughput is harmed by threads waiting to obtain locks.
So (as I read it), ext2 (and ext3) split their freelist information into
N independent pieces ("groups").

Ideally, there would then be a separate lock for each group, and N
threads would then be able to allocate blocks from the filesystem in
parallel, where N=#of groups.

But locks do take up memory, and there is no point in having N locks
when the number of threads actually doing filesystem ops in parallel is
much less than N. So AIUI this header is effectively a "heuristic" that
attempts to allocate a reasonable number of locks, so that performance
is not damaged by too few locks but memory is not wasted by having too
many locks. A modulo operation then maps "group" number onto a lock; the
"NR_BG_LOCKS must be power of two" comment is there so that bit-and can
be used as a modulo operation.

Header defines:

static inline spinlock_t *
bgl_lock_ptr(struct blockgroup_lock *bgl, unsigned int block_group)
{
    return &bgl->locks[(block_group) & (NR_BG_LOCKS-1)].lock;
}

So when NR_CPUS=2, NR_BG_LOCKS is 8. And calling bgl_lock_ptr for group #22 (freelist fragment#22) will then return lock#6 (22%8) while operations on group#23 use lock#7. As they
use different locks, the freespace-update operations can occur in parallel.

When NR_CPUS=4096, 128 locks are allocated; more memory is therefore
needed to hold the locks but presumably on a large system the reduced
lock contention will be worth the memory use.

Of course this assumes that each thread choses a different freelist
"group" - but even if they chose randomly, performance would be better
than just having one lock.

IMO, using NR_CPUS (a compile-time constant) isn't a very elegant way to
do this; perhaps some kind of "on demand" system where the number of
locks increases depending on the actual amount of contention occurring
would be nicer. But of course that takes overhead to measure and adapt.
For a start, NR_CPUS doesn't indicate how many cpus are actually online
in the system. But I presume that in practice, the hard-wired approach
works well enough; currently anybody with a large system probably
recompiles their own kernel with a suitable value for NR_CPUS.

It would be interesting to know how many "groups" ext2 divides its
freespace into. Does it also chose a value based on the size of the
server the filesystem was created on?

Warning: I could be wrong about any of the above; this is just my
understanding from a brief look at the code.

Cheers
Simon

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ