[numbers] Re: [patch] more skb ops inlining

Ingo Molnar <mingo@xxxxxxx> · Mon, 17 Aug 2009 23:26:29 +0200

* Heiko Carstens <heiko.carstens@xxxxxxxxxx> wrote:

> An allyesconfig (CONFIG_FTRACE disabled) on x86_64 with gcc 4.3.2 
> results in the following vmlinux sizes:
> 
> 574152755 2009-08-16 20:05 vmlinux-orig
> 574862000 2009-08-16 20:13 vmlinux-skb-inline
> 
> The skb-inline variant was created by using your patch and on top 
> of it the patch below. Size increased by 692KB (0.12%).
> 
> Dunno.. is that acceptable?

i ported your patch to latest mainline (see it attached below) i've 
done x86(-64) defconfig (with disabled CONFIG_OPTIMIZE_INLINING) 
measurements vmlinux:

   text	   data	    bss	     dec	    hex	filename
8455329	1192304	 989908	10637541	 a250e5	vmlinux.before
8471036	1192304	 989908	10653248	 a28e40	vmlinux.after

that's a +0.18% text size increase - quite significant.

I also did performance measurements, trying to figure out whether 
this increase in size can be measured via any runtime performance 
metrics such as cycles, instructions or cache-misses.

To do that i picked a micro-benchmarks lat_tcp.

( Note that microbenchmarks are biased in favor of inlining patches
  like yours, because they disproportionately favor the speed
  benefit of it while not adequately showing the cache footprint
  disadvantage of increased kernel text size. )

Without your patch i got:

titan:~/l> perf stat --repeat 10 -e task-clock -e cycles -e instructions -e L1-icache-refs -e L1-icache-misses ./lat_tcp localhost
TCP latency using localhost: 17.5198 microseconds
TCP latency using localhost: 17.5060 microseconds
TCP latency using localhost: 17.4684 microseconds
TCP latency using localhost: 17.5416 microseconds
TCP latency using localhost: 17.4881 microseconds
TCP latency using localhost: 17.4832 microseconds
TCP latency using localhost: 17.4830 microseconds
TCP latency using localhost: 17.4723 microseconds
TCP latency using localhost: 17.4733 microseconds
TCP latency using localhost: 17.5255 microseconds

 Performance counter stats for './lat_tcp localhost' (10 runs):

    1494.625043  task-clock-msecs         #      0.564 CPUs    ( +-   0.077% )
     3665868023  cycles                   #   2452.701 M/sec   ( +-   0.070% )
     2163212606  instructions             #      0.590 IPC     ( +-   0.090% )
     2449056216  L1-icache-loads          #   1638.576 M/sec   ( +-   0.056% )
       61081182  L1-icache-load-misses    #     40.867 M/sec   ( +-   0.192% )

    2.649159316  seconds time elapsed   ( +-   0.090% )

With your patch i got:

titan:~/l> perf stat --repeat 10 -e task-clock -e cycles -e instructions -e L1-icache-refs -e L1-icache-misses ./lat_tcp localhost
TCP latency using localhost: 17.6913 microseconds
TCP latency using localhost: 17.7218 microseconds
TCP latency using localhost: 17.6882 microseconds
TCP latency using localhost: 17.7196 microseconds
TCP latency using localhost: 17.6411 microseconds
TCP latency using localhost: 17.7136 microseconds
TCP latency using localhost: 17.6488 microseconds
TCP latency using localhost: 17.6663 microseconds
TCP latency using localhost: 17.7445 microseconds
TCP latency using localhost: 17.6851 microseconds

 Performance counter stats for './lat_tcp localhost' (10 runs):

    1497.479984  task-clock-msecs         #      0.564 CPUs    ( +-   0.106% )
     3690097505  cycles                   #   2464.205 M/sec   ( +-   0.109% )
     2162432402  instructions             #      0.586 IPC     ( +-   0.127% )
     2437974192  L1-icache-loads          #   1628.051 M/sec   ( +-   0.105% )
       63064588  L1-icache-load-misses    #     42.114 M/sec   ( +-   0.278% )

    2.656952976  seconds time elapsed   ( +-   0.110% )

Summarized (for lat_tcp):

|  Before:
|
|    1494.625043  task-clock-msecs         #      0.564 CPUs    ( +-   0.077% )
|     3665868023  cycles                   #   2452.701 M/sec   ( +-   0.070% )
|     2163212606  instructions             #      0.590 IPC     ( +-   0.090% )
|
|  After:
|
|    1497.479984  task-clock-msecs         #      0.564 CPUs    ( +-   0.106% )
|     3690097505  cycles                   #   2464.205 M/sec   ( +-   0.109% )
|     2162432402  instructions             #      0.586 IPC     ( +-   0.127% )

So we've got a 0.2% slowdown, which is provably outside the noise 
level of 0.1%. The instruction count remained almost the same.

The reason for the slowdown can be seen in the L1 icache-fetch 
stats:

     2449056216  L1-icache-loads          #   1638.576 M/sec   ( +-   0.056% )
       61081182  L1-icache-load-misses    #     40.867 M/sec   ( +-   0.192% )

     2437974192  L1-icache-loads          #   1628.051 M/sec   ( +-   0.105% )
       63064588  L1-icache-load-misses    #     42.114 M/sec   ( +-   0.278% )

The icache-misses went up from 40.867 M/sec to 42.114 M/sec, a 
3.05% worsening - well beyond the statistical significance of 
0.28%.

So unless i botched the measurements or some accidental code layout 
differences punish the patch artificially, this patch is a small 
but measureable performance regression on x86.

The slowdown itself is barely provable with a test repeat count of 
10 iterations (not a surprise really - such patches were not really 
provable before via /usr/bin/time kinds of measurements), on an 
otherwise totally idle system.

The icache-miss stats do support the notion that what is causing 
the slowdown is the critical path fallout out of the icache.

( By all means please repeat these measurements - they are very
  easy to do via 'perf stat --repeat 10' - see the examples
  above. )

To build the kernel i used gcc 4.3.2. The CPU i used is friendly to 
inlining patches as well: it's an Extreme Edition Core2 Duo (booted 
to a single core, to stabilize the numbers):

  Intel(R) Core(TM)2 CPU E6800  @ 2.93GHz

With 4MB of L2 cache. I suspect smaller CPUs would feel the 
negative effects of inlining in a more pronounced way.

Thanks,

	Ingo

---
 include/linux/skbuff.h |   37 +++++++++++++++++++++++++++++++++++--
 net/core/skbuff.c      |   40 ----------------------------------------
 2 files changed, 35 insertions(+), 42 deletions(-)

Index: linux/include/linux/skbuff.h
===================================================================

--- linux.orig/include/linux/skbuff.h
+++ linux/include/linux/skbuff.h
@@ -1120,7 +1120,6 @@ static inline void skb_set_tail_pointer(
 /*
  *	Add data to an sk_buff
  */
-extern unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
 static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len)
 {
 	unsigned char *tmp = skb_tail_pointer(skb);
@@ -1130,7 +1129,24 @@ static inline unsigned char *__skb_put(s
 	return tmp;
 }
 
-extern unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
+/**
+ *	skb_put - add data to a buffer
+ *	@skb: buffer to use
+ *	@len: amount of data to add
+ *
+ *	This function extends the used data area of the buffer. If this would
+ *	exceed the total buffer size the kernel will panic. A pointer to the
+ *	first byte of the extra data is returned.
+ */
+static inline unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
+{
+	unsigned char *tmp = __skb_put(skb, len);
+
+	if (unlikely(skb->tail > skb->end))
+		skb_over_panic(skb, len, __builtin_return_address(0));
+	return tmp;
+}
+
 static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len)
 {
 	skb->data -= len;
@@ -1138,6 +1154,23 @@ static inline unsigned char *__skb_push(
 	return skb->data;
 }
 
+/**
+ *	skb_push - add data to the start of a buffer
+ *	@skb: buffer to use
+ *	@len: amount of data to add
+ *
+ *	This function extends the used data area of the buffer at the buffer
+ *	start. If this would exceed the total buffer headroom the kernel will
+ *	panic. A pointer to the first byte of the extra data is returned.
+ */
+static inline unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
+{
+	__skb_push(skb, len);
+	if (unlikely(skb->data<skb->head))
+		skb_under_panic(skb, len, __builtin_return_address(0));
+	return skb->data;
+}
+
 extern unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
 static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
 {
Index: linux/net/core/skbuff.c
===================================================================
--- linux.orig/net/core/skbuff.c
+++ linux/net/core/skbuff.c
@@ -1000,46 +1000,6 @@ free_skb:
 EXPORT_SYMBOL(skb_pad);
 
 /**
- *	skb_put - add data to a buffer
- *	@skb: buffer to use
- *	@len: amount of data to add
- *
- *	This function extends the used data area of the buffer. If this would
- *	exceed the total buffer size the kernel will panic. A pointer to the
- *	first byte of the extra data is returned.
- */
-unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
-{
-	unsigned char *tmp = skb_tail_pointer(skb);
-	SKB_LINEAR_ASSERT(skb);
-	skb->tail += len;
-	skb->len  += len;
-	if (unlikely(skb->tail > skb->end))
-		skb_over_panic(skb, len, __builtin_return_address(0));
-	return tmp;
-}
-EXPORT_SYMBOL(skb_put);
-
-/**
- *	skb_push - add data to the start of a buffer
- *	@skb: buffer to use
- *	@len: amount of data to add
- *
- *	This function extends the used data area of the buffer at the buffer
- *	start. If this would exceed the total buffer headroom the kernel will
- *	panic. A pointer to the first byte of the extra data is returned.
- */
-unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
-{
-	skb->data -= len;
-	skb->len  += len;
-	if (unlikely(skb->data<skb->head))
-		skb_under_panic(skb, len, __builtin_return_address(0));
-	return skb->data;
-}
-EXPORT_SYMBOL(skb_push);
-
-/**
  *	skb_pull - remove data from the start of a buffer
  *	@skb: buffer to use
  *	@len: amount of data to remove
--
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html