We have seen similar poor performance with Intel S3700 and S3710 on LSI SAS3008 with CFQ on 3.13, 3.18 and 3.19 kernels.
Switching to noop fixed the problems for us.
On Fri, Jul 10, 2015 at 4:30 AM, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote:
>>That’s very strange. Is nothing else using the disks?
no. only the fio benchmark.
>>The difference between noop and cfq should be (and in my experience is) marginal for such a benchmark.
maybe a bug in cfq (kernel 3.16 debian jessie) ? also, deadline scheduler give me same perf than noop.
----- Mail original -----
De: "Jan Schermer" <jan@xxxxxxxxxxx>
À: "aderumier" <aderumier@xxxxxxxxx>
Cc: "Somnath Roy" <Somnath.Roy@xxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Envoyé: Jeudi 9 Juillet 2015 18:20:51
Objet: Re: Investigating my 100 IOPS limit
That’s very strange. Is nothing else using the disks?
The difference between noop and cfq should be (and in my experience is) marginal for such a benchmark.
Jan
> On 09 Jul 2015, at 18:11, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote:
>
> Hi again,
>
> I totally forgot to check the io scheduler from my last tests, this was with cfq.
>
> with noop scheduler, I have a huge difference
>
> cfq:
>
> - sequential syncronous 4k write iodepth=1 : 60 iops
> - sequential syncronous 4k write iodepth=32 : 2000 iops
>
>
> noop:
>
> - sequential syncronous 4k write iodepth=1 : 7866 iops
> - sequential syncronous 4k write iodepth=32 : 34303 iops
>
>
> ----- Mail original -----
> De: "Somnath Roy" <Somnath.Roy@xxxxxxxxxxx>
> À: "Jan Schermer" <jan@xxxxxxxxxxx>, "aderumier" <aderumier@xxxxxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> Envoyé: Jeudi 9 Juillet 2015 17:46:41
> Objet: RE: Investigating my 100 IOPS limit
>
> I am not sure how increasing iodepth for sync write is giving you better result..sync fio engine supposed to be always using iodepth =1.
> BTW, I faced similar issues sometimes back,..By running the following fio job file, I was getting very dismal performance on my SSD on top of XFS..
>
> [random-write]
> directory=/mnt/fio_test
> rw=randwrite
> bs=16k
> direct=1
> sync=1
> time_based
> runtime=1m
> size=700G
> group_reporting
>
> Result :
> --------
> IOPS = 420
>
> lat (usec) : 250=0.10%, 500=2.28%, 750=22.25%, 1000=0.01%
> lat (msec) : 2=20.05%, 4=46.64%, 10=8.68%
>
>
> Turned out that is a SSD FW problem...Some SSDs tend to misbehave in this pattern (even directly with block device, without any XFS) because they don't handle O_DIRECT|O_SYNC writes well..I am sure you will find some reference by digging into ceph mail list. That's why not all SSDs behave well with Ceph journal..
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Jan Schermer
> Sent: Thursday, July 09, 2015 8:24 AM
> To: Alexandre DERUMIER
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re: Investigating my 100 IOPS limit
>
> Those are very strange numbers. Is the “60” figure right?
>
> Can you paste the full fio command and output?
> Thanks
>
> Jan
>
>> On 09 Jul 2015, at 15:58, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote:
>>
>> I just tried on an intel s3700, on top of xfs
>>
>> fio , with
>> - sequential syncronous 4k write iodepth=1 : 60 iops
>> - sequential syncronous 4k write iodepth=32 : 2000 iops
>> - random syncronous 4k write, iodepth=1 : 8000iops
>> - random syncronous 4k write iodepth=32 : 18000 iops
>>
>>
>>
>> ----- Mail original -----
>> De: "aderumier" <aderumier@xxxxxxxxx>
>> À: "Jan Schermer" <jan@xxxxxxxxxxx>
>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
>> Envoyé: Jeudi 9 Juillet 2015 15:50:35
>> Objet: Re: Investigating my 100 IOPS limit
>>
>>>> Any ideas where to look? I was hoping blktrace would show what
>>>> exactly is going on, but it just shows a synchronous write -> (10ms)
>>>> -> completed
>>
>> which size is the write in this case ? 4K ? or more ?
>>
>>
>> ----- Mail original -----
>> De: "Jan Schermer" <jan@xxxxxxxxxxx>
>> À: "aderumier" <aderumier@xxxxxxxxx>
>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
>> Envoyé: Jeudi 9 Juillet 2015 15:29:15
>> Objet: Re: Investigating my 100 IOPS limit
>>
>> I tried everything: —write-barrier, —sync —fsync, —fdatasync I never
>> get the same 10ms latency. Must be something the filesystem journal/log does that is special.
>>
>> Any ideas where to look? I was hoping blktrace would show what exactly
>> is going on, but it just shows a synchronous write -> (10ms) ->
>> completed
>>
>> Jan
>>
>>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote:
>>>
>>>>> I have 12K IOPS in this test on the block device itself. But only
>>>>> 100 filesystem transactions (=IOPS) on filesystem on the same
>>>>> device because the “flush” (=FUA?) operation takes 10ms to finish.
>>>>> I just can’t replicate the >>same “flush” operation with fio on the
>>>>> block device, unfortunately, so I have no idea what is causing that
>>>>> :/
>>>
>>> AFAIK, with fio on block device with --sync=1, is doing flush after each write.
>>>
>>> I'm not sure with fio on a filesystem, but filesystem should do a fsync after file write.
>>>
>>>
>>> ----- Mail original -----
>>> De: "Jan Schermer" <jan@xxxxxxxxxxx>
>>> À: "aderumier" <aderumier@xxxxxxxxx>
>>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
>>> Envoyé: Jeudi 9 Juillet 2015 14:43:46
>>> Objet: Re: Investigating my 100 IOPS limit
>>>
>>> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 and higher have it for sure.
>>>
>>> I have 12K IOPS in this test on the block device itself. But only 100
>>> filesystem transactions (=IOPS) on filesystem on the same device
>>> because the “flush” (=FUA?) operation takes 10ms to finish. I just
>>> can’t replicate the same “flush” operation with fio on the block
>>> device, unfortunately, so I have no idea what is causing that :/
>>>
>>> Jan
>>>
>>>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote:
>>>>
>>>> Hi,
>>>> I have already see bad performance with Crucial m550 ssd, 400 iops syncronous write.
>>>>
>>>> Not sure what model of ssd do you have ?
>>>>
>>>> see this:
>>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
>>>> -ssd-is-suitable-as-a-journal-device/
>>>>
>>>> what is your result of disk directly with
>>>>
>>>> #dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync
>>>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k
>>>> --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
>>>> --name=journal-test
>>>>
>>>> ?
>>>>
>>>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough mode, and don't have any problem.
>>>>
>>>>
>>>> also about centos 2.6.32, I'm not sure FUA support has been
>>>> backported by redhat (since true FUA support is since 2.6.37), so maybe it's the old barrier code.
>>>>
>>>>
>>>> ----- Mail original -----
>>>> De: "Jan Schermer" <jan@xxxxxxxxxxx>
>>>> À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
>>>> Envoyé: Jeudi 9 Juillet 2015 12:32:04
>>>> Objet: Investigating my 100 IOPS limit
>>>>
>>>> I hope this would be interesting for some, it nearly cost me my sanity.
>>>>
>>>> Some time ago I came here with a problem manifesting as a “100 IOPS*” limit with the LSI controllers and some drives.
>>>> It almost drove me crazy as I could replicate the problem with ease
>>>> but when I wanted to show it to someone it was often gone. Sometimes
>>>> it required fio to write for some time for the problem to manifest
>>>> again, required seemingly conflicting settings to come up…
>>>>
>>>> Well, turns out the problem is fio calling fallocate() when creating the file to use for this test, which doesn’t really allocate the blocks, it just “reserves” them.
>>>> When fio writes to those blocks, the filesystem journal becomes the bottleneck (100 IOPS* limit can be seen there with 100% utilization).
>>>>
>>>> If, however, I create the file with dd or such, those writes do _not_ end in the journal, and the result is 10K synchronous 4K IOPS on the same drive.
>>>> If, for example, I run fio with a 1M block size, it would still do 100* IOPS and when I then run a 4K block size test without deleting the file, it would run at a 10K IOPS pace until it hits the first unwritten blocks - then it slows to a crawl again.
>>>>
>>>> The same issue is present with XFS and ext3/ext4 (with default mount options), and no matter how I create the filesystem or mount it can I avoid this problem. The only way to avoid this problem is to mount ext4 with -o journal_async_commit, which should be safe, but...
>>>>
>>>> I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and Kingston SSDs in this case (interestingly, this issue does not seem to occur on Samsung SSDs!). I think it has something to do with LSI faking a “FUA” support for the drives (AFAIK they don’t support it so the controller must somehow flush the cache, which is what introduces a huge latency hit).
>>>> I can’t replicate this problem on the block device itself, only on a file on filesystem, so it might as well be a kernel/driver bug. I have a blktrace showing the difference between the “good” and “bad” writes, but I don’t know what the driver/controller does - I only see the write on the log device finishing after a long 10ms.
>>>>
>>>> Could someone tell me how CEPH creates the filesystem objects? I suppose it does fallocate() as well, right? Any way to force it to write them out completely and not use it to get around this issue I have?
>>>>
>>>> How to replicate:
>>>>
>>>> fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write
>>>> --bs=4k --numjobs=1 --iodepth=1 --runtime=7200 --group_reporting
>>>> --name=journal-test --size=1000M --ioengine=libaio
>>>>
>>>>
>>>> * It is in fact 98 IOPS. Exactly. Not more, not less :-)
>>>>
>>>> Jan
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com