Hello, On Thu, 23 May 2013, Simon Kirby wrote: > > Last time I also checked the assembler output from x86-32 > > and it looked good. I used 'make net/netfilter/ipvs/ip_vs_wlc.s' > > to generate asm output, note the 's' extension. > > > > May be problem in your ipvs.c file comes from the > > fact that __u64 is unsigned long long for 32-bit platforms. > > I changed the code to use 'unsigned long' as follows: > > > > #if 0 > > typedef unsigned long long __u64; > > #else > > typedef unsigned long __u64; > > #endif > > > > and the x86-32 platform works correctly. > > x86-64 works for both cases: 'unsigned long long' and > > 'unsigned long' but x86-32 generates many mul operations > > for the 'unsigned long long' case which is not used in > > the kernel. That is why I didn't noticed such problem. > > > > So, I think we should be safe by adding > > (__u64) without any (__u32), next days I'll check again > > the asm output. May be (__u32) before weight ensures > > 32x32 multiply but it should not be needed. > > Hmm, I was comparing atomic_t being s32 versus u32, not u64 being u64. :) > Anyway, the .s results are much easier to read, and (closer to) reality! > I did a comparison with (__u64)loh * atomic_read(dest->weight) versus > (__u64)loh * (__u32)atomic_read(dest->weight) on both arches and uploaded > them to http://0x.ca/sim/ref/3.9-ipvs/. It's not a huge difference, but I > prefer the shorter/faster version. ;) I now see why your patch shows difference compared to my tests month ago. This change is the culprit: - int loh, doh; + unsigned int loh, doh; It effectively changes the operation from: (__u64/__s64) int * int into (__u64) unsigned int * int that is why you fix it by using __u32: (__u64) unsigned int * unsigned int so that both operands are from same 4-byte signedness. I think, we should keep loh and doh to be int, may be the following both solutions should generate 32x32 multiply: 1. same as my first email: int loh, doh; (__u64/__s64) loh * atomic_read(&dest->weight) In this case I see only one difference between __u64 and __s64: - jb .L41 #, - ja .L79 #, + jl .L41 #, + jg .L79 #, 2. Your patch: unsigned int loh, doh; (__u64) loh * (__u32) atomic_read(&dest->weight) or (__s64) loh * (__u32) atomic_read(&dest->weight) Both solutions generate code that differs only in imul vs. mul. In internet I see that imul is preferred/faster than mul. That is why I prefer solution 1, it has less casts. So, I think you can change your patch as follows: 1. Use int for loh, doh. Note that some schedulers use 'unsigned int' and should be patched for this definition: NQ, SED, WLC 2. Use (__u64) prefix only, no (__u32) before atomic_read: LBLC, LBLCR, NQ, SED, WLC (__u64) loh * atomic_read(&dest->weight) ... (__u64) doh * ... 3. Explain in commit message that we find the result64=int32*int32 faster than result64=uint32*uint32 and far better than using 64*64 multiply which is a bit slower on older CPUs. Regards -- Julian Anastasov <ja@xxxxxx> -- To unsubscribe from this list: send the line "unsubscribe lvs-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html