On Sat, 2010-05-15 at 22:58 +0800, Ryan Wang wrote: > Hi, > > Recently I was reading some code about filesystems, and I found > the usage of struct blockgroup_lock is confusing. I have 3 questions > about it, and please help give me some instuctions on them. > > http://lxr.linux.no/#linux+v2.6.32/include/linux/blockgroup_lock.h > > 1) What is the relationship between NR_CPUS and NR_BG_LOCKS ? > 2) And how to understand the usage of struct blockgroup_lock? > 3) What is the relationship between block_group and NR_BG_LOCKS, e.g. > in blk_lock_ptr? > > 56static inline spinlock_t * > 57bgl_lock_ptr(struct blockgroup_lock *bgl, unsigned int block_group) > 58{ > 59 return &bgl->locks[(block_group) & (NR_BG_LOCKS-1)].lock; > 60} I'm no expert, but your question sounded interesting so I took a look. >From my reading of the code, the BG_LOCKS stuff is a "heuristic" to improve performance on multi-CPU systems. An ext2/ext3 system keeps information about the free (unused) blocks in the filesystem. When a thread calls into the filesystem to allocate a new block for a file (or allocate an inode, or release an existing used block back onto the freelist), it must modify freelist info. And to avoid race conditions, it must take a lock before modifying the structure. Presumably, on a system running multiple threads this operation can be contended, ie throughput is harmed by threads waiting to obtain locks. So (as I read it), ext2 (and ext3) split their freelist information into N independent pieces ("groups"). Ideally, there would then be a separate lock for each group, and N threads would then be able to allocate blocks from the filesystem in parallel, where N=#of groups. But locks do take up memory, and there is no point in having N locks when the number of threads actually doing filesystem ops in parallel is much less than N. So AIUI this header is effectively a "heuristic" that attempts to allocate a reasonable number of locks, so that performance is not damaged by too few locks but memory is not wasted by having too many locks. A modulo operation then maps "group" number onto a lock; the "NR_BG_LOCKS must be power of two" comment is there so that bit-and can be used as a modulo operation. Header defines: static inline spinlock_t * bgl_lock_ptr(struct blockgroup_lock *bgl, unsigned int block_group) { return &bgl->locks[(block_group) & (NR_BG_LOCKS-1)].lock; } So when NR_CPUS=2, NR_BG_LOCKS is 8. And calling bgl_lock_ptr for group #22 (freelist fragment#22) will then return lock#6 (22%8) while operations on group#23 use lock#7. As they use different locks, the freespace-update operations can occur in parallel. When NR_CPUS=4096, 128 locks are allocated; more memory is therefore needed to hold the locks but presumably on a large system the reduced lock contention will be worth the memory use. Of course this assumes that each thread choses a different freelist "group" - but even if they chose randomly, performance would be better than just having one lock. IMO, using NR_CPUS (a compile-time constant) isn't a very elegant way to do this; perhaps some kind of "on demand" system where the number of locks increases depending on the actual amount of contention occurring would be nicer. But of course that takes overhead to measure and adapt. For a start, NR_CPUS doesn't indicate how many cpus are actually online in the system. But I presume that in practice, the hard-wired approach works well enough; currently anybody with a large system probably recompiles their own kernel with a suitable value for NR_CPUS. It would be interesting to know how many "groups" ext2 divides its freespace into. Does it also chose a value based on the size of the server the filesystem was created on? Warning: I could be wrong about any of the above; this is just my understanding from a brief look at the code. Cheers Simon -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ