On Mon, Jan 23, 2012 at 03:51:43PM -0500, Zheng Da wrote: > Hello > > On Mon, Jan 23, 2012 at 2:34 PM, Zheng Da <zhengda1936@xxxxxxxxx> wrote: > > > > > I build XFS on the top of ramdisk. So yes, there is a lot of small > >> > concurrent writes in a second. > >> > I create a file of 4GB in XFS (the ramdisk has 5GB of space). My test > >> > program overwrites 4G of data to the file and each time writes a page of > >> > data randomly to the file. It's always overwriting, and no appending. > >> The > >> > offset of each write is always aligned to the page size. There is no > >> > overlapping between writes. > >> > >> Why are you using XFS for this? tmpfs was designed to do this sort > >> of stuff as efficiently as possible.... > >> > > OK, I can try that. > > > tmpfs doesn't support direct IO. it doesn't need to. The ramdisk is copying data into it's own private page cache and you are using direct Io to avoid the system page cache (i.e. a double copy). tmpfs just uses the system page cache, so tehre's only one copy and it has a much shorter and less complex IO path than XFS..... > >> > So the test case is pretty simple and I think it's easy to reproduce it. > >> > It'll be great if you can try the test case. > >> > >> Can you post your test code so I know what I test is exactly what > >> you are running? > >> > > I can do that. My test code gets very complicated now. I need to simplify > > it. > > > Here is the code. It's still a bit long. I hope it's OK. > You can run the code like "rand-read file option=direct pages=1048576 > threads=8 access=write/read". With 262144 pages on a 2Gb ramdisk, the results I get on 3.2.0 are Threads Read Write 1 0.92s 1.49s 2 0.51s 1.20s 4 0.31s 1.34s 8 0.22s 1.59s 16 0.23s 2.24s the contention is on the ip->i_ilock, and the newsize update is one of the offenders It probably needs this change to xfs_aio_write_newsize_update(): - if (new_size == ip->i_new_size) { + if (new_size && new_size == ip->i_new_size) { to avoid the lock being taken here. But all that newsize crap is gone in the current git Linus tree, so how much would that gains us: Threads Read Write 1 0.88s 0.85s 2 0.54s 1.20s 4 0.31s 1.23s 8 0.27s 1.40s 16 0.25s 2.36s Pretty much nothing. IOWs, it's just like I suspected - you are doing so many write IOs that you are serialising on the extent lookup and write checks which use exclusive locking.. Given that it is 2 lock traversals per write IO, we're limiting at about 4-500,000 exclusive lock grabs per second and decreasing as contention goes up. For reads, we are doing 2 shared (nested) lookups per read IO, we appear to be limiting at around 2,000,000 shared lock grabs per second. Ahmdals law is kicking in here, but it means if we could make the writes to use a shared lock, it would at least scale like the reads for this "no metadata modification except for mtime" overwrite case. I don't think that the generic write checks absolutely need exclusive locking - we probably could get away with a shared lock and only fall back to exclusive when we need to do EOF zeroing. Similarly, for the block mapping code if we don't need to do allocation, a shared lock is all we need. So maybe in that case for direct IO when create == 1, we can do a read lookup first and only grab the lock exclusively if that falls in a hole and requires allocation..... Let me think about it for a bit.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs