Investigating my 100 IOPS limit

Jan Schermer <jan@xxxxxxxxxxx> · Thu, 9 Jul 2015 12:32:04 +0200

I hope this would be interesting for some, it nearly cost me my sanity.

Some time ago I came here with a problem manifesting as a “100 IOPS*” limit with the LSI controllers and some drives.
It almost drove me crazy as I could replicate the problem with ease but when I wanted to show it to someone it was often gone. Sometimes it required fio to write for some time for the problem to manifest again, required seemingly conflicting settings to come up…

Well, turns out the problem is fio calling fallocate() when creating the file to use for this test, which doesn’t really allocate the blocks, it just “reserves” them.
When fio writes to those blocks, the filesystem journal becomes the bottleneck (100 IOPS* limit can be seen there with 100% utilization).

If, however, I create the file with dd or such, those writes do _not_ end in the journal, and the result is 10K synchronous 4K IOPS on the same drive.
If, for example, I run fio with a 1M block size, it would still do 100* IOPS and when I then run a 4K block size test without deleting the file, it would run at a 10K IOPS pace until it hits the first unwritten blocks - then it slows to a crawl again.

The same issue is present with XFS and ext3/ext4 (with default mount options), and no matter how I create the filesystem or mount it can I avoid this problem. The only way to avoid this problem is to mount ext4 with -o journal_async_commit, which should be safe, but...

I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and Kingston SSDs in this case (interestingly, this issue does not seem to occur on Samsung SSDs!). I think it has something to do with LSI faking a “FUA” support for the drives (AFAIK they don’t support it so the controller must somehow flush the cache, which is what introduces a huge latency hit).
I can’t replicate this problem on the block device itself, only on a file on filesystem, so it might as well be a kernel/driver bug. I have a blktrace showing the difference between the “good” and “bad” writes, but I don’t know what the driver/controller does - I only see the write on the log device finishing after a long 10ms. 

Could someone tell me how CEPH creates the filesystem objects? I suppose it does fallocate() as well, right? Any way to force it to write them out completely and not use it to get around this issue I have?

How to replicate:

fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=7200 --group_reporting --name=journal-test --size=1000M --ioengine=libaio

* It is in fact 98 IOPS. Exactly. Not more, not less :-)

Jan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com