Re: Investigating my 100 IOPS limit

Jan Schermer <jan@xxxxxxxxxxx> · Thu, 9 Jul 2015 15:58:36 +0200

This is blktrace of the problem but this is the first time I’ve used it. It begins when “swapper” (probably because it is a dirty page and thus gets flushed?) show up.

+8 should mean 8 sectors = 4KiB?

 8,160  1    55925     2.692651182 1436712  Q FWS [fio]
  8,160  1    55926     2.692652285 1436712  G FWS [fio]
  8,160  1    55927     2.692652923 1436712  I FWS [fio]
  8,160  1    55928     2.692842413  6770  C  WS 16760024 + 8 [0]
  8,160  1    55929     2.692881511  6770  C  WS 0 [0]
  8,160  1    55930     2.693029408     0  C  WS 4023248 + 8 [0]
  8,160  1    55931     2.693066725  6770  C  WS 0 [0]
  8,160  1    55932     2.693266146     0  C  WS 16328824 + 8 [0]
  8,160  1    55933     2.693297033 1436712  Q FWS [fio]
  8,160  1    55934     2.693297628 1436712  G FWS [fio]
  8,160  1    55935     2.693298042 1436712  I FWS [fio]
  8,160  1    55936     2.693478425     0  C  WS 11772464 + 8 [0]
  8,160  1    55937     2.693484440 1436712  Q FWS [fio]
  8,160  1    55938     2.693484919 1436712  G FWS [fio]
  8,160  1    55939     2.693485180 1436712  I FWS [fio]
  8,160  1    55940     2.693669076     0  C  WS 19637160 + 8 [0]
  8,160  1    55941     2.693675679 1436712  Q FWS [fio]
  8,160  1    55942     2.693676180 1436712  G FWS [fio]
  8,160  1    55943     2.693676706 1436712  I FWS [fio]
  8,160  1    55944     2.693898452     0  C  WS 12364112 + 8 [0]
  8,160  1    55945     2.693963594     0  C  WS 0 [0]
  8,160  1    55946     2.693968646 1436712  U   N [fio] 0
  8,160  1    55947     2.693990010 1436712  Q  WS 1864456 + 8 [fio]
  8,160  1    55948     2.693990846 1436712  G  WS 1864456 + 8 [fio]
  8,160  1    55949     2.693991543 1436712  I  WS 1864456 + 8 [fio]
  8,160  1    55950     2.693992127 1436712  D  WS 1864456 + 8 [fio]
  8,160  1    55951     2.693995244 1436712  U   N [fio] 1
  8,160  1    55952     2.694233777 1436712  Q FWFS 468960767 + 16 [fio]
  8,160  1    55953     2.694234504 1436712  G FWFS 468960767 + 16 [fio]
  8,160  1    55954     2.694235144 1436712  I FWFS 468960767 + 16 [fio]
  8,160  1    55955     2.706561235  8989  C WFS 468960767 + 16 [0]
  8,160  1    55956     2.706562115  8989  C WFS 468960767 [0]
  8,160  1    55957     2.725928434 1436712  Q FWFS 468960804 + 18 [fio]
  8,160  1    55958     2.725929908 1436712  G FWFS 468960804 + 18 [fio]
  8,160  1    55959     2.725930980 1436712  I FWFS 468960804 + 18 [fio]
  8,160  7       12     2.706817448     0  C  WS 17954496 + 8 [0]
  8,160  7       13     2.706892420     0  D WFS 468960783 + 10 [swapper]
  8,160  7       14     2.715675674     0  C  WS 3242200 + 8 [0]
  8,160  7       15     2.715771745     0  D WFS 468960793 + 11 [swapper]
  8,160  7       16     2.736072549     0  C  WS 2058840 + 8 [0]
  8,160  7       17     2.736167908     0  D WFS 468960822 + 10 [swapper]
  8,160  7       18     2.746284722     0  C  WS 20142728 + 8 [0]
  8,160  7       19     2.746370687     0  D WFS 468960832 + 11 [swapper]

Jan

> On 09 Jul 2015, at 15:50, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote:
> 
>>> Any ideas where to look? I was hoping blktrace would show what exactly is going on, but it just shows a synchronous write -> (10ms) -> completed
> 
> which size is the write in this case ? 4K ? or more ?
> 
> 
> ----- Mail original -----
> De: "Jan Schermer" <jan@xxxxxxxxxxx>
> À: "aderumier" <aderumier@xxxxxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> Envoyé: Jeudi 9 Juillet 2015 15:29:15
> Objet: Re:  Investigating my 100 IOPS limit
> 
> I tried everything: —write-barrier, —sync —fsync, —fdatasync 
> I never get the same 10ms latency. Must be something the filesystem journal/log does that is special. 
> 
> Any ideas where to look? I was hoping blktrace would show what exactly is going on, but it just shows a synchronous write -> (10ms) -> completed 
> 
> Jan 
> 
>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote: 
>> 
>>>> I have 12K IOPS in this test on the block device itself. But only 100 filesystem transactions (=IOPS) on filesystem on the same device because the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the >>same “flush” operation with fio on the block device, unfortunately, so I have no idea what is causing that :/ 
>> 
>> AFAIK, with fio on block device with --sync=1, is doing flush after each write. 
>> 
>> I'm not sure with fio on a filesystem, but filesystem should do a fsync after file write. 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Jan Schermer" <jan@xxxxxxxxxxx> 
>> À: "aderumier" <aderumier@xxxxxxxxx> 
>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> 
>> Envoyé: Jeudi 9 Juillet 2015 14:43:46 
>> Objet: Re:  Investigating my 100 IOPS limit 
>> 
>> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 and higher have it for sure. 
>> 
>> I have 12K IOPS in this test on the block device itself. But only 100 filesystem transactions (=IOPS) on filesystem on the same device because the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the same “flush” operation with fio on the block device, unfortunately, so I have no idea what is causing that :/ 
>> 
>> Jan 
>> 
>>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote: 
>>> 
>>> Hi, 
>>> I have already see bad performance with Crucial m550 ssd, 400 iops syncronous write. 
>>> 
>>> Not sure what model of ssd do you have ? 
>>> 
>>> see this: 
>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ 
>>> 
>>> what is your result of disk directly with 
>>> 
>>> #dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync 
>>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test 
>>> 
>>> ? 
>>> 
>>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough mode, and don't have any problem. 
>>> 
>>> 
>>> also about centos 2.6.32, I'm not sure FUA support has been backported by redhat (since true FUA support is since 2.6.37), 
>>> so maybe it's the old barrier code. 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Jan Schermer" <jan@xxxxxxxxxxx> 
>>> À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> 
>>> Envoyé: Jeudi 9 Juillet 2015 12:32:04 
>>> Objet:  Investigating my 100 IOPS limit 
>>> 
>>> I hope this would be interesting for some, it nearly cost me my sanity. 
>>> 
>>> Some time ago I came here with a problem manifesting as a “100 IOPS*” limit with the LSI controllers and some drives. 
>>> It almost drove me crazy as I could replicate the problem with ease but when I wanted to show it to someone it was often gone. Sometimes it required fio to write for some time for the problem to manifest again, required seemingly conflicting settings to come up… 
>>> 
>>> Well, turns out the problem is fio calling fallocate() when creating the file to use for this test, which doesn’t really allocate the blocks, it just “reserves” them. 
>>> When fio writes to those blocks, the filesystem journal becomes the bottleneck (100 IOPS* limit can be seen there with 100% utilization). 
>>> 
>>> If, however, I create the file with dd or such, those writes do _not_ end in the journal, and the result is 10K synchronous 4K IOPS on the same drive. 
>>> If, for example, I run fio with a 1M block size, it would still do 100* IOPS and when I then run a 4K block size test without deleting the file, it would run at a 10K IOPS pace until it hits the first unwritten blocks - then it slows to a crawl again. 
>>> 
>>> The same issue is present with XFS and ext3/ext4 (with default mount options), and no matter how I create the filesystem or mount it can I avoid this problem. The only way to avoid this problem is to mount ext4 with -o journal_async_commit, which should be safe, but... 
>>> 
>>> I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and Kingston SSDs in this case (interestingly, this issue does not seem to occur on Samsung SSDs!). I think it has something to do with LSI faking a “FUA” support for the drives (AFAIK they don’t support it so the controller must somehow flush the cache, which is what introduces a huge latency hit). 
>>> I can’t replicate this problem on the block device itself, only on a file on filesystem, so it might as well be a kernel/driver bug. I have a blktrace showing the difference between the “good” and “bad” writes, but I don’t know what the driver/controller does - I only see the write on the log device finishing after a long 10ms. 
>>> 
>>> Could someone tell me how CEPH creates the filesystem objects? I suppose it does fallocate() as well, right? Any way to force it to write them out completely and not use it to get around this issue I have? 
>>> 
>>> How to replicate: 
>>> 
>>> fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=7200 --group_reporting --name=journal-test --size=1000M --ioengine=libaio 
>>> 
>>> 
>>> * It is in fact 98 IOPS. Exactly. Not more, not less :-) 
>>> 
>>> Jan 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> ceph-users@xxxxxxxxxxxxxx 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com