On Tue, Apr 1, 2008 at 6:05 PM, Matt Garman <matthew.garman@xxxxxxxxx> wrote: > We're using multicast basically for some inter-processs > communication. > Which protocol(-s) are in use? (UDP, IGMP, ...) > We timestamp (and log, in a separate thread) all of our sends and > receives, and do analysis on the logs. > Are timestamps sent in the broadcast? If so, can the timestamps be out of sync generating the "delays"? > We're finding occassional (once or twice a day) "blips" where the > receipt of multicast messages is delayed anywhere from 200 > milliseconds to three or four whole seconds. > > In one case, we have only one server in the network, and are still > seeing this. In this scenario, do the multicast messages actually > use the physical network? > > I'm running sar on these machines (collecting data every five > seconds); any delay >600 ms seems to conincide with extremely high > iowait (but the load on any CPU during these times is always below > 1.0). > > We have the sysctl net.core.rmem_max parameter set to 33554432. > > Our code uses setsockopt() to set the recieving buffer to the > maximum size allowed by the kernel (i.e. 33554432 in our case). > > The servers are generally lightly loaded: typically they have a load > of <1.0, and rarely does the load exceed 3.0---yet the servers have > eight physical cores. > > This is with kernel 2.6.9-42.ELsmp, i.e. the default for CentOS 4.4. > > This doesn't appear to be a CPU problem. I wrote a simple multicast > testing program. It sends a constant stream of messages, and, in a > separate thread, logs the time of each send. I wrote a > corresponding receive program (logs receive times in a separate > thread). Running eight instances of cpuburn, I can't generate any > significant delays. However, if I run something like > > dd bs=1024000 if=/dev/zero of=zeros.dat count=12288 > > I can create multicast delays over one second. This will also > generate high iowait in the sar log. However, in actual production > use, no process should ever push the disk as hard as that "dd" test. > (In other words, while I can duplicate the problem, I'm not sure > it's a fair test). > > Any ideas or suggestions would be much appreciated. I don't really > know enough about the kernel's network architecture to devise any > more tests or know how else I might be able to pinpoint the cause of > this problem. > > Thank you, > Matt > > -- > To unsubscribe from this list: send the line "unsubscribe linux-net" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-net" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html