Re: Multicast delays and high iowait

"H. Willstrand" <h.willstrand@xxxxxxxxx> · Tue, 1 Apr 2008 22:24:19 +0200

On Tue, Apr 1, 2008 at 6:05 PM, Matt Garman <matthew.garman@xxxxxxxxx> wrote:
> We're using multicast basically for some inter-processs
>  communication.
>

Which protocol(-s) are in use? (UDP, IGMP, ...)

>  We timestamp (and log, in a separate thread) all of our sends and
>  receives, and do analysis on the logs.
>

Are timestamps sent in the broadcast? If so, can the timestamps be out
of sync generating the "delays"?

>  We're finding occassional (once or twice a day) "blips" where the
>  receipt of multicast messages is delayed anywhere from 200
>  milliseconds to three or four whole seconds.
>
>  In one case, we have only one server in the network, and are still
>  seeing this.  In this scenario, do the multicast messages actually
>  use the physical network?
>
>  I'm running sar on these machines (collecting data every five
>  seconds); any delay >600 ms seems to conincide with extremely high
>  iowait (but the load on any CPU during these times is always below
>  1.0).
>
>  We have the sysctl net.core.rmem_max parameter set to 33554432.
>
>  Our code uses setsockopt() to set the recieving buffer to the
>  maximum size allowed by the kernel (i.e. 33554432 in our case).
>
>  The servers are generally lightly loaded: typically they have a load
>  of <1.0, and rarely does the load exceed 3.0---yet the servers have
>  eight physical cores.
>
>  This is with kernel 2.6.9-42.ELsmp, i.e. the default for CentOS 4.4.
>
>  This doesn't appear to be a CPU problem.  I wrote a simple multicast
>  testing program.  It sends a constant stream of messages, and, in a
>  separate thread, logs the time of each send.  I wrote a
>  corresponding receive program (logs receive times in a separate
>  thread).  Running eight instances of cpuburn, I can't generate any
>  significant delays.  However, if I run something like
>
>     dd bs=1024000 if=/dev/zero of=zeros.dat count=12288
>
>  I can create multicast delays over one second.  This will also
>  generate high iowait in the sar log.  However, in actual production
>  use, no process should ever push the disk as hard as that "dd" test.
>  (In other words, while I can duplicate the problem, I'm not sure
>  it's a fair test).
>
>  Any ideas or suggestions would be much appreciated.  I don't really
>  know enough about the kernel's network architecture to devise any
>  more tests or know how else I might be able to pinpoint the cause of
>  this problem.
>
>  Thank you,
>  Matt
>
>  --
>  To unsubscribe from this list: send the line "unsubscribe linux-net" in
>  the body of a message to majordomo@xxxxxxxxxxxxxxx
>  More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html