(tl;dr summary: Sometimes the Linux page cache slows down fio --direct=1 tests despite the direct I/O, and whether or not it happens appears to depend on a race condition involving multiple fio job threads and udev, as well as a quirk of Linux page-cache behavior. As a performance testing program, it should be consistent.) We've been running some fio tests against a Linux device-mapper driver we're working on, and we've found a curious bimodal distribution of performance values with "fio --direct=1 --rw=randwrite --ioengine=libaio --numjobs=2 --thread --norandommap ..." directly to the device, on the 3.2 kernel (using the Debian "Squeeze" distro). One of our developers found that in the "slow" cases, the kernel is spending more time in __lookup in the kernel radix tree library (used by the page cache) than in the "fast" cases, even though we're using direct I/O. After digging into the fio and kernel code a while, and sacrificing a couple chickens at the altar of SystemTap [1], this is my current understanding of the situation: During the do_io call, the target file (in my case, a block device) is opened by each "job" thread. There's a hash table keyed by filename. In most runs, a job thread either will find the filename in the hash already, or will not find it and after opening the file will add an entry to the hash table. Once in a while, though, a thread will not find the filename in the hash table, but after it opens the file and tries to update the hash table, it finds the filename is now present, thanks to the other job thread. This causes the generic_open_file code to close the file and try again, this next time finding the filename in the hash table. However, the closing of a file opened with read/write access triggers udev to run blkid and sometimes udisks-part-id. These run quickly, but open and read the device without using O_DIRECT. This causes some pages (about 25) to be inserted into the page cache, and the page count in the file's associated mapping structure is incremented. Those entries are only discarded when the individual pages are overwritten (unlikely to happen for all the pages under randwrite and norandommap unless we write far more than the device size), or all at once when the open-handle count on the file goes to zero (which won't be until the test finishes), or memory pressure gets too high, etc. As fio runs its test, the kernel function generic_file_direct_write is called, and if page cache mappings exist (mapping->nrpages > 0), it calls into the page cache code to invalidate any mappings associated with the pages being written. On our test machines, the cost seems to be on the microsecond scale per invalidation call, but it adds up; at 1GB/s using 4KB pages, for example, we would invalidate 256K pages per second. For an easy-to-describe example that highlights the problem, I tested with the device-mapper dm-zero module (dmsetup create zzz --table "0 4000000000000 zero"; fio --filename=/dev/mapper/zzz ...) which involves no external I/O hardware, to see what just the kernel's behavior is. This device ignores data on writing and zero-fills on read, just like /dev/zero, except it's a block device and thus can interact with the page cache. I did a set of runs with a 4KB block size[2], and got upwards of 3800MB/s when the race condition didn't trigger; the few times when it did, the write rate was under 2400MB/s, a drop of over 35%. (Since that's with two threads, the individual threads are doing 1900MB/s vs 1200MB/s, or 2.2us/block vs 3.4us/block.) A couple of 10x larger test runs got similar results, though both fast and slow "modes" were a little bit faster. With a 1MB block size, fewer invalidation calls happened but they operated on larger ranges of addresses in each call, and the difference was down around 2%. The results were in two very tightly grouped clusters, though, so the 2% isn't just random run-to-run variance. Why each invalidation call should be that expensive with so few pages mapped, I don't know, but it appears to be costly, if you've got a device that should get GB/s-range performance. (There may be lock contention issues exacerbating the problem, since the mapping is shared.) Not using --norandommap seems to consistently avoid the problem, probably because it calls smalloc in each thread, which allocates and clears the random-map memory while holding a global lock; that may stagger processing in the threads enough to avoid the race most of the time, but probably doesn't guarantee it. Using one job, or not using --thread, would avoid the problem because there wouldn't be two threads competing over one instance of the hash table. I have a few ideas to try to make the behavior more consistent. I tried using a global lock in filesetup.c:generic_file_open for file opening and hash table updating; it seems to eliminate the "slow" results, but it seems rather hackish, so I'm still looking at how to fix the issue. (Thanks to Michael Sclafani at Permabit for his help in digging into this.) Oh yes: I'm also seeing another blkid run triggered at the start of the fio invocation, I believe caused by opening and closing the device in order to ascertain its size. There's a delay of 0.1s or so before the actual test starts, which seems to be long enough for blkid to complete, but I don't see anything that ensures that it actually has completed. There may be another difficult-to-hit race condition there. Ken [1] I log calls to __blkdev_get, __blkdev_put, blkdev_close, blkdev_open for the target device with process ids and names, and bdev->bd_openers values, and periodically report mapping->nrpages with the saved filp->f_mapping value from blkdev_open, so I can see when the blkid and fio open/close sequences overlap, and when page cache mappings are retained. The script is available at http://pastebin.com/gM3kURHp for now. [2] My full command line, including some irrelevant options inherited from our original test case: .../fio --bs=4096 --rw=randwrite --name=generic_job_name --filename=/dev/mapper/zzz --numjobs=2 --size=26843545600 --thread --norandommap --group_reporting --gtod_reduce=1 --unlink=0 --direct=1 --rwmixread=70 --iodepth=1024 --iodepth_batch_complete=16 --iodepth_batch_submit=16 --ioengine=libaio --scramble_buffers=1 --offset=0 --offset_increment=53687091200 -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html