Hi Jim, I'm resurrecting an ancient thread here, but: we've just observed this on another big cluster and remembered that this hasn't actually been fixed. I think the right solution is to make an option that will setsockopt on SO_RECVBUF to some value (say, 256KB). I pushed a branch that does this, wip-tcp. Do you mind checking to see if this addresses the issue (without manually adjusting things in /proc)? And perhaps we should consider making this default to 256KB... Thanks! sage On Fri, 24 Feb 2012, Jim Schutt wrote: > On 02/02/2012 10:52 AM, Gregory Farnum wrote: > > On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt<jaschut@xxxxxxxxxx> wrote: > > > I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive > > > per OSD. During a test I watch both OSD servers with both > > > vmstat and iostat. > > > > > > During a "good" period, vmstat says the server is sustaining> 2 GB/s > > > for multiple tens of seconds. Since I use replication factor 2, that > > > means that server is sustaining> 500 MB/s aggregate client throughput, > > > right? During such a period vmstat also reports ~10% CPU idle. > > > > > > During a "bad" period, vmstat says the server is doing ~200 MB/s, > > > with lots of idle cycles. It is during these periods that > > > messages stuck in the policy throttler build up such long > > > wait times. Sometimes I see really bad periods with aggregate > > > throughput per server< 100 MB/s. > > > > > > The typical pattern I see is that a run starts with tens of seconds > > > of aggregate throughput> 2 GB/s. Then it drops and bounces around > > > 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then > > > it ramps back up near 2 GB/s again. > > > > Hmm. 100MB/s is awfully low for this theory, but have you tried to > > correlate the drops in throughput with the OSD journals running out of > > space? I assume from your setup that they're sharing the disk with the > > store (although it works either way), and your description makes me > > think that throughput is initially constrained by sequential journal > > writes but then the journal runs out of space and the OSD has to wait > > for the main store to catch up (with random IO), and that sends the IO > > patterns all to hell. (If you can say that random 4MB IOs are > > hellish.) > > I'm also curious about memory usage as a possible explanation for the > > more dramatic drops. > > I've finally figured out what is going on with this behaviour. > Memory usage was on the right track. > > It turns out to be an unfortunate interaction between the > number of OSDs/server, number of clients, TCP socket buffer > autotuning, the policy throttler, and limits on the total > memory used by the TCP stack (net/ipv4/tcp_mem sysctl). > > What happens is that for throttled reader threads, the > TCP stack will continue to receive data as long as there > is available socket buffer, and the sender has data to send. > > As each reader thread receives successive messages, the > TCP socket buffer autotuning increases the size of the > socket buffer. Eventually, due to the number of OSDs > per server and the number of clients trying to write, > all the memory the TCP stack is allowed by net/ipv4/tcp_mem > to use is consumed by the socket buffers of throttled > reader threads. When this happens, TCP processing is affected > to the point that the TCP stack cannot send ACKs on behalf > of the reader threads that aren't throttled. At that point > the OSD stalls until the TCP retransmit count on some connection > is exceeded, causing it to be reset. > > Since my OSD servers don't run anything else, the simplest > solution for me is to turn off socket buffer autotuning > (net/ipv4/tcp_moderate_rcvbuf), and set the default socket > buffer size to something reasonable. 256k seems to be > working well for me right now. > > -- Jim > > > -Greg > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html