Race condition between iput() and __sync_one() question

Charlie Brett <cfb@xxxxxx> · Thu, 01 Jun 2006 13:57:57 -0600

I've read though what I believe are all the threads related to the
possible race condition between iput() and __sync_one(). It appears from
Rahul's last posting to Marcello, that it was decided it was impossible
to have iput() and __sync_one both trying to process the same inode.

If this is a wrong assumption on may part, and the problem was already
found and fixed, let me know, and just ignore the rest of this.

Configuration:
8-way IA-64 / 2.4.20 (but the problem exists in 2.4.25)

I can state with confidence that it does indeed occur, and I have
multiple crash dumps that prove it.
The most common occurence seems to be with writing to a /proc file
(running irqbalance does this). If kupdated runs while the file is on
the s_dirty list, it will call __sync_one(). If, at the same time,
iput() is occurring, a race condition can occur.
On 2.4.20 kernels, the window is between the clearing of the I_LOCK and
the testing of I_FREEING in __sync_one(). In 2.4.25, the window is
between the clearing of I_LOCK in __sync_one() and the test of I_FREEING
in __refile_inode().
Here's how the window is opened: If an interrupt comes in between the
clearing of I_LOCK and the testing of I_FREEING, then there is an
opportunity for iput() to call clear_inode() (which clears all the bits
except I_CLEAR, and can even go as far as calling destroy_inode(). Under
severe memory pressure, we have seen the system go as far as returning
the icache page and that page was given to another process. 
This is where bad things happen:
When the interrupt returns, the inode could get move to the unused list.
If the inode had been return to the inode cache in destroy inode, the
next time the inode is allocated, it will be added to the in use list.
At this point, the 2 lists are linked together (since get_new_inode()
does not do a list_del() on the inode before doing a list_add() ).
Also, when the interrupt returns, it is possible to interleave list
operations between __sync_one() and dispose_list(), which doesn't hold
the spinlock. This can cause all sorts of strange connections, including
loops, depending on the architecture.
One more bad thing. If the low latency patch is installed, and the in
use and unused lists are linked, then it is possible for the unused head
to be moved onto the in use list, and we've even seen the in use head on
a dispose list.

The easiest way I know to reproduce the problem is the following:
1) You need to have more than 2 processors (we're running 8). There's a
report that it mayt have occured on a 2-way system.
2) The system needs to be busy (not overly busy). We have been creating
inodes on multple volume by creating lots of files and deleting them.
This just keeps up the demand on inodes and interrupts.
3) /proc seems to be the best trigger. Just have an infinite loop of
writing to a /proc file. I was using /proc/sys/kernel/kdb. I just kept
echoing a 0 into it.
4) You can either wait for something bad to happen, or do what we did:
In __sync_one(), just before the wakeup() at the end, check for the
I_CLEAR bit set. We also opened the window by adding mdelay(1) after
clearing the I_LOCK bit. If the code was safe, this should not add any
risk.

/proc is interesting, because there is a delete_inode() function that
just sets the I_CLEAR bit. That's another problem....

-- 
Charlie Brett <cfb@xxxxxx>

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html