On Wed, Jan 02, 2013 at 08:08:48PM +0000, Eric Wong wrote: > (changing Cc:) > > Eric Wong <normalperson@xxxxxxxx> wrote: > > I'm finding ppoll() unexpectedly stuck when waiting for POLLIN on a > > local TCP socket. The isolated code below can reproduces the issue > > after many minutes (<1 hour). It might be easier to reproduce on > > a busy system while disk I/O is happening. > > s/might be/is/ > > Strangely, I've bisected this seemingly networking-related issue down to > the following commit: > > commit 1fb3f8ca0e9222535a39b884cb67a34628411b9f > Author: Mel Gorman <mgorman@xxxxxxx> > Date: Mon Oct 8 16:29:12 2012 -0700 > > mm: compaction: capture a suitable high-order page immediately when it is made available > > That commit doesn't revert cleanly on v3.7.1, and I don't feel > comfortable touching that code myself. > That patch introduced an accounting bug that was corrected by ef6c5be6 (fix incorrect NR_FREE_PAGES accounting (appears like memory leak)). In some cases that could look like a hang and potentially confuses a bisection. That said, I see that you report that 3.7.1 and 3.8-rc2 are affected that includes that fix and the finger is pointed at compaction so something is wrong. > Instead, I disabled THP+compaction under v3.7.1 and I've been unable to > reproduce the issue without THP+compaction. > Implying that it's stuck in compaction somewhere. It could be the case that compaction alters timing enough to trigger another bug. You say it tests differently depending on whether TCP or unix sockets are used which might indicate multiple problems. However, lets try and see if compaction is the primary problem or not. > As I mention in http://mid.gmane.org/20121229113434.GA13336@xxxxxxxxxxxxx > I run my below test (`toosleepy') with heavy network and disk activity > for a long time before hitting this. > Using a 3.7.1 or 3.8-rc2 kernel, can you reproduce the problem and then answer the following questions please? 1. What are the contents of /proc/vmstat at the time it is stuck? 2. What are the contents of /proc/PID/stack for every toosleepy process when they are stuck? 3. Can you do a sysrq+m and post the resulting dmesg? What I'm looking for is a throttling bug (if pgscan_direct_throttle is elevated), an isolated page accounting bug (nr_isolated_* is elevated and process is stuck in congestion_wait in a too_many_isolated() loop) or a free page accounting bug (big difference between nr_free_pages and buddy list figures). I'll try reproducing this early next week if none of that shows an obvious candidate. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>