Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 02/11/2010 09:24 AM, Borislav Petkov wrote:
> On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
>> Let me prep another version when I get back on Wed. (currently
>> travelling) with all the stuff we discussed to see how it would turn.
> 
> Ok, here's another version ontop of PeterZ's patch at
> http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
> differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
> while on 64-bit I do "popcnt %rdi, %rdi".

On 64 bits it should be "popcnt %rdi, %rax".

> I also did some rudimentary tracing with the function graph tracer of
> all the cpumask_weight-calls in <kernel/sched.c> while doing a kernel
> compile and the preliminary results show that hweight in software takes
> about 9.768 usecs the longest while the hardware popcnt about 8.515
> usecs. The machine is a Fam10 revB2 quadcore.
> 
> What remains to be done is see whether the saving/restoring of
> callee-clobbered regs with this patch has any noticeable negative
> effects on the software hweight case on machines which don't support
> popcnt. Also, I'm open for better tracing ideas :).
> 
> +	asm volatile(PUSH_CLOBBERED
> +		     ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
> +		      POP_CLOBBERED
> +		     : "="ARG0 (res)
> +		     : ARG0 (w));


Sorry, no.

You don't do the push/pop inline -- if you're going to take the hit of
pushing this into the caller, it's better to list them as explicit
clobbers and let the compiler figure out how to do it.  The point of
doing an explicit push/pop is that it can be pushed into the out-of-line
subroutine.

Furthermore, you're still putting "volatile" on there... this is a pure
computation -- no side effects -- so it is exactly when you *shouldn't*
declare your asm statement volatile.

Note: given how simple and regular a popcnt actually is, it might be
preferrable to have the out-of-line implementation either in assembly,
or using gcc's -fcall-saved-* options to reduce the number of registers
that is clobbered by the routine.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux