Re: Investigating my 100 IOPS limit

Jan Schermer <jan@xxxxxxxxxxx> · Thu, 9 Jul 2015 17:24:12 +0200

Those are very strange numbers. Is the “60” figure right?

Can you paste the full fio command and output?
Thanks

Jan

> On 09 Jul 2015, at 15:58, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote:
> 
> I just tried on an intel s3700, on top of xfs
> 
> fio , with 
> - sequential syncronous 4k write iodepth=1  : 60 iops
> - sequential syncronous 4k write iodepth=32 : 2000 iops
> - random syncronous 4k write, iodepth=1 : 8000iops
> - random syncronous 4k write iodepth=32 : 18000 iops
> 
> 
> 
> ----- Mail original -----
> De: "aderumier" <aderumier@xxxxxxxxx>
> À: "Jan Schermer" <jan@xxxxxxxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> Envoyé: Jeudi 9 Juillet 2015 15:50:35
> Objet: Re:  Investigating my 100 IOPS limit
> 
>>> Any ideas where to look? I was hoping blktrace would show what exactly is going on, but it just shows a synchronous write -> (10ms) -> completed 
> 
> which size is the write in this case ? 4K ? or more ? 
> 
> 
> ----- Mail original ----- 
> De: "Jan Schermer" <jan@xxxxxxxxxxx> 
> À: "aderumier" <aderumier@xxxxxxxxx> 
> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> 
> Envoyé: Jeudi 9 Juillet 2015 15:29:15 
> Objet: Re:  Investigating my 100 IOPS limit 
> 
> I tried everything: —write-barrier, —sync —fsync, —fdatasync 
> I never get the same 10ms latency. Must be something the filesystem journal/log does that is special. 
> 
> Any ideas where to look? I was hoping blktrace would show what exactly is going on, but it just shows a synchronous write -> (10ms) -> completed 
> 
> Jan 
> 
>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote: 
>> 
>>>> I have 12K IOPS in this test on the block device itself. But only 100 filesystem transactions (=IOPS) on filesystem on the same device because the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the >>same “flush” operation with fio on the block device, unfortunately, so I have no idea what is causing that :/ 
>> 
>> AFAIK, with fio on block device with --sync=1, is doing flush after each write. 
>> 
>> I'm not sure with fio on a filesystem, but filesystem should do a fsync after file write. 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Jan Schermer" <jan@xxxxxxxxxxx> 
>> À: "aderumier" <aderumier@xxxxxxxxx> 
>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> 
>> Envoyé: Jeudi 9 Juillet 2015 14:43:46 
>> Objet: Re:  Investigating my 100 IOPS limit 
>> 
>> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 and higher have it for sure. 
>> 
>> I have 12K IOPS in this test on the block device itself. But only 100 filesystem transactions (=IOPS) on filesystem on the same device because the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the same “flush” operation with fio on the block device, unfortunately, so I have no idea what is causing that :/ 
>> 
>> Jan 
>> 
>>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote: 
>>> 
>>> Hi, 
>>> I have already see bad performance with Crucial m550 ssd, 400 iops syncronous write. 
>>> 
>>> Not sure what model of ssd do you have ? 
>>> 
>>> see this: 
>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ 
>>> 
>>> what is your result of disk directly with 
>>> 
>>> #dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync 
>>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test 
>>> 
>>> ? 
>>> 
>>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough mode, and don't have any problem. 
>>> 
>>> 
>>> also about centos 2.6.32, I'm not sure FUA support has been backported by redhat (since true FUA support is since 2.6.37), 
>>> so maybe it's the old barrier code. 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Jan Schermer" <jan@xxxxxxxxxxx> 
>>> À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> 
>>> Envoyé: Jeudi 9 Juillet 2015 12:32:04 
>>> Objet:  Investigating my 100 IOPS limit 
>>> 
>>> I hope this would be interesting for some, it nearly cost me my sanity. 
>>> 
>>> Some time ago I came here with a problem manifesting as a “100 IOPS*” limit with the LSI controllers and some drives. 
>>> It almost drove me crazy as I could replicate the problem with ease but when I wanted to show it to someone it was often gone. Sometimes it required fio to write for some time for the problem to manifest again, required seemingly conflicting settings to come up… 
>>> 
>>> Well, turns out the problem is fio calling fallocate() when creating the file to use for this test, which doesn’t really allocate the blocks, it just “reserves” them. 
>>> When fio writes to those blocks, the filesystem journal becomes the bottleneck (100 IOPS* limit can be seen there with 100% utilization). 
>>> 
>>> If, however, I create the file with dd or such, those writes do _not_ end in the journal, and the result is 10K synchronous 4K IOPS on the same drive. 
>>> If, for example, I run fio with a 1M block size, it would still do 100* IOPS and when I then run a 4K block size test without deleting the file, it would run at a 10K IOPS pace until it hits the first unwritten blocks - then it slows to a crawl again. 
>>> 
>>> The same issue is present with XFS and ext3/ext4 (with default mount options), and no matter how I create the filesystem or mount it can I avoid this problem. The only way to avoid this problem is to mount ext4 with -o journal_async_commit, which should be safe, but... 
>>> 
>>> I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and Kingston SSDs in this case (interestingly, this issue does not seem to occur on Samsung SSDs!). I think it has something to do with LSI faking a “FUA” support for the drives (AFAIK they don’t support it so the controller must somehow flush the cache, which is what introduces a huge latency hit). 
>>> I can’t replicate this problem on the block device itself, only on a file on filesystem, so it might as well be a kernel/driver bug. I have a blktrace showing the difference between the “good” and “bad” writes, but I don’t know what the driver/controller does - I only see the write on the log device finishing after a long 10ms. 
>>> 
>>> Could someone tell me how CEPH creates the filesystem objects? I suppose it does fallocate() as well, right? Any way to force it to write them out completely and not use it to get around this issue I have? 
>>> 
>>> How to replicate: 
>>> 
>>> fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=7200 --group_reporting --name=journal-test --size=1000M --ioengine=libaio 
>>> 
>>> 
>>> * It is in fact 98 IOPS. Exactly. Not more, not less :-) 
>>> 
>>> Jan 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> ceph-users@xxxxxxxxxxxxxx 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@xxxxxxxxxxxxxx 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com