Re: Impact of page cache on OSD read performance for SSD

Milosz Tanski <milosz@xxxxxxxxx> · Wed, 24 Sep 2014 10:29:44 -0400

On Wed, Sep 24, 2014 at 9:27 AM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:
> On 09/24/2014 07:38 AM, Sage Weil wrote:
>>
>> On Wed, 24 Sep 2014, Haomai Wang wrote:
>>>
>>> I agree with that direct read will help for disk read. But if read data
>>> is hot and small enough to fit in memory, page cache is a good place to
>>> hold data cache. If discard page cache, we need to implement a cache to
>>> provide with effective lookup impl.
>>
>>
>> This is true for some workloads, but not necessarily true for all.  Many
>> clients (notably RBD) will be caching at the client side (in VM's fs, and
>> possibly in librbd itself) such that caching at the OSD is largely wasted
>> effort.  For RGW the often is likely true, unless there is a varnish cache
>> or something in front.
>>
>> We should probably have a direct_io config option for filestore.  But even
>> better would be some hint from the client about whether it is caching or
>> not so that FileStore could conditionally cache...
>
>
> I like the hinting idea.  Having said that, if the effect being seen is due
> to page cache, it seems like something is off.  We've seen performance
> issues in the kernel before so it's not unprecedented. Working around it
> with direct IO could be the right way to go, but it might be that this is
> something that could be fixed higher up and improve performance in other
> scenarios too.  I'd hate to let it go by the wayside of we could find
> something actionable.
>
>
>>
>> sage
>>
>>   >
>>>
>>> BTW, whether to use direct io we can refer to MySQL Innodb engine with
>>> direct io and PostgreSQL with page cache.
>>>
>>> On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
>>> wrote:
>>>>
>>>> Haomai,
>>>> I am considering only about random reads and the changes I made only
>>>> affecting reads. For write, I have not measured yet. But, yes, page cache
>>>> may be helpful for write coalescing. Still need to evaluate how it is
>>>> behaving comparing direct_io on SSD though. I think Ceph code path will be
>>>> much shorter if we use direct_io in the write path where it is actually
>>>> executing the transactions. Probably, the sync thread and all will not be
>>>> needed.
>>>>
>>>> I am trying to analyze where is the extra reads coming from in case of
>>>> buffered io by using blktrace etc. This should give us a clear understanding
>>>> what exactly is going on there and it may turn out that tuning kernel
>>>> parameters only  we can achieve similar performance as direct_io.
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>> -----Original Message-----
>>>> From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx]
>>>> Sent: Tuesday, September 23, 2014 7:07 PM
>>>> To: Sage Weil
>>>> Cc: Somnath Roy; Milosz Tanski; ceph-devel@xxxxxxxxxxxxxxx
>>>> Subject: Re: Impact of page cache on OSD read performance for SSD
>>>>
>>>> Good point, but do you have considered that the impaction for write ops?
>>>> And if skip page cache, FileStore is responsible for data cache?
>>>>
>>>> On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>>>>
>>>>> On Tue, 23 Sep 2014, Somnath Roy wrote:
>>>>>>
>>>>>> Milosz,
>>>>>> Thanks for the response. I will see if I can get any information out
>>>>>> of perf.
>>>>>>
>>>>>> Here is my OS information.
>>>>>>
>>>>>> root@emsclient:~# lsb_release -a
>>>>>> No LSB modules are available.
>>>>>> Distributor ID: Ubuntu
>>>>>> Description:    Ubuntu 13.10
>>>>>> Release:        13.10
>>>>>> Codename:       saucy
>>>>>> root@emsclient:~# uname -a
>>>>>> Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46
>>>>>> UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>
>>>>>> BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter
>>>>>> I was able to get almost *2X* performance improvement with direct_io.
>>>>>> It's not only page cache (memory) lookup, in case of buffered_io  the
>>>>>> following could be problem.
>>>>>>
>>>>>> 1. Double copy (disk -> file buffer cache, file buffer cache -> user
>>>>>> buffer)
>>>>>>
>>>>>> 2. As the iostat output shows, it is not reading 4K only, it is
>>>>>> reading more data from disk as required and in the end it will be
>>>>>> wasted in case of random workload..
>>>>>
>>>>>
>>>>> It might be worth using blktrace to see what the IOs it is issueing
>>>>> are.
>>>>> Which ones are > 4K and what they point to...
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Somnath
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Milosz Tanski [mailto:milosz@xxxxxxxxx]
>>>>>> Sent: Tuesday, September 23, 2014 12:09 PM
>>>>>> To: Somnath Roy
>>>>>> Cc: ceph-devel@xxxxxxxxxxxxxxx
>>>>>> Subject: Re: Impact of page cache on OSD read performance for SSD
>>>>>>
>>>>>> Somnath,
>>>>>>
>>>>>> I wonder if there's a bottleneck or a point of contention for the
>>>>>> kernel. For a entirely uncached workload I expect the page cache lookup to
>>>>>> cause a slow down (since the lookup should be wasted). What I wouldn't
>>>>>> expect is a 45% performance drop. Memory speed should be one magnitude
>>>>>> faster then a modern SATA SSD drive (so it should be more negligible
>>>>>> overhead).
>>>>>>
>>>>>> Is there anyway you could perform the same test but monitor what's
>>>>>> going on with the OSD process using the perf tool? Whatever is the default
>>>>>> cpu time spent hardware counter is fine. Make sure you have the kernel debug
>>>>>> info package installed so can get symbol information for kernel and module
>>>>>> calls. With any luck the diff in perf output in two runs will show us the
>>>>>> culprit.
>>>>>>
>>>>>> Also, can you tell us what OS/kernel version you're using on the OSD
>>>>>> machines?
>>>>>>
>>>>>> - Milosz
>>>>>>
>>>>>> On Tue, Sep 23, 2014 at 2:05 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Sage,
>>>>>>> I have created the following setup in order to examine how a single
>>>>>>> OSD is behaving if say ~80-90% of ios hitting the SSDs.
>>>>>>>
>>>>>>> My test includes the following steps.
>>>>>>>
>>>>>>>          1. Created a single OSD cluster.
>>>>>>>          2. Created two rbd images (110GB each) on 2 different pools.
>>>>>>>          3. Populated entire image, so my working set is ~210GB. My
>>>>>>> system memory is ~16GB.
>>>>>>>          4. Dumped page cache before every run.
>>>>>>>          5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two
>>>>>>> images.
>>>>>>>
>>>>>>> Here is my disk iops/bandwidth..
>>>>>>>
>>>>>>>          root@emsclient:~/fio_test# fio rad_resd_disk.job
>>>>>>>          random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K,
>>>>>>> ioengine=libaio, iodepth=64
>>>>>>>          2.0.8
>>>>>>>          Starting 1 process
>>>>>>>          Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0
>>>>>>> iops] [eta 00m:00s]
>>>>>>>          random-reads: (groupid=0, jobs=1): err= 0: pid=1431
>>>>>>>          read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt=
>>>>>>> 60002msec
>>>>>>>
>>>>>>> My fio_rbd config..
>>>>>>>
>>>>>>> [global]
>>>>>>> ioengine=rbd
>>>>>>> clientname=admin
>>>>>>> pool=rbd1
>>>>>>> rbdname=ceph_regression_test1
>>>>>>> invalidate=0    # mandatory
>>>>>>> rw=randread
>>>>>>> bs=4k
>>>>>>> direct=1
>>>>>>> time_based
>>>>>>> runtime=2m
>>>>>>> size=109G
>>>>>>> numjobs=8
>>>>>>> [rbd_iodepth32]
>>>>>>> iodepth=32
>>>>>>>
>>>>>>> Now, I have run Giant Ceph on top of that..
>>>>>>>
>>>>>>> 1. OSD config with 25 shards/1 thread per shard :
>>>>>>> -------------------------------------------------------
>>>>>>>
>>>>>>>           avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>            22.04    0.00   16.46   45.86    0.00   15.64
>>>>>>>
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> sda               0.00     9.00    0.00    6.00     0.00    92.00
>>>>>>> 30.67     0.01    1.33    0.00    1.33   1.33   0.80
>>>>>>> sdd               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sde               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdg               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdf               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdh             181.00     0.00 34961.00    0.00 176740.00     0.00
>>>>>>> 10.11   102.71    2.92    2.92    0.00   0.03 100.00
>>>>>>> sdc               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdb               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>>
>>>>>>>
>>>>>>> ceph -s:
>>>>>>>   ----------
>>>>>>> root@emsclient:~# ceph -s
>>>>>>>      cluster 94991097-7638-4240-b922-f525300a9026
>>>>>>>       health HEALTH_OK
>>>>>>>       monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch
>>>>>>> 1, quorum 0 a
>>>>>>>       osdmap e498: 1 osds: 1 up, 1 in
>>>>>>>        pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>>>>>>              366 GB used, 1122 GB / 1489 GB avail
>>>>>>>                   832 active+clean
>>>>>>>    client io 75215 kB/s rd, 18803 op/s
>>>>>>>
>>>>>>>   cpu util:
>>>>>>> ----------
>>>>>>>   Gradually decreases from ~21 core (serving from cache) to ~10 core
>>>>>>> (while serving from disks).
>>>>>>>
>>>>>>>   My Analysis:
>>>>>>> -----------------
>>>>>>>   In this case "All is Well"  till ios are served from cache (XFS is
>>>>>>> smart enough to cache some data ) . Once started hitting disks and
>>>>>>> throughput is decreasing. As you can see, disk is giving ~35K iops , but,
>>>>>>> OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems
>>>>>>> to be very  expensive.  Half of the iops are waste. Also, looking at the
>>>>>>> bandwidth, it is obvious, not everything is 4K read, May be kernel
>>>>>>> read_ahead is kicking (?).
>>>>>>>
>>>>>>>
>>>>>>> Now, I thought of making ceph disk read as direct_io and do the same
>>>>>>> experiment. I have changed the FileStore::read to do the direct_io only.
>>>>>>> Rest kept as is. Here is the result with that.
>>>>>>>
>>>>>>>
>>>>>>> Iostat:
>>>>>>> -------
>>>>>>>
>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>            24.77    0.00   19.52   21.36    0.00   34.36
>>>>>>>
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdd               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sde               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdg               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdf               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdh               0.00     0.00 25295.00    0.00 101180.00     0.00
>>>>>>> 8.00    12.73    0.50    0.50    0.00   0.04 100.80
>>>>>>> sdc               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdb               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>>
>>>>>>> ceph -s:
>>>>>>>   --------
>>>>>>> root@emsclient:~/fio_test# ceph -s
>>>>>>>      cluster 94991097-7638-4240-b922-f525300a9026
>>>>>>>       health HEALTH_OK
>>>>>>>       monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch
>>>>>>> 1, quorum 0 a
>>>>>>>       osdmap e522: 1 osds: 1 up, 1 in
>>>>>>>        pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>>>>>>              366 GB used, 1122 GB / 1489 GB avail
>>>>>>>                   832 active+clean
>>>>>>>    client io 100 MB/s rd, 25618 op/s
>>>>>>>
>>>>>>> cpu util:
>>>>>>> --------
>>>>>>>    ~14 core while serving from disks.
>>>>>>>
>>>>>>>   My Analysis:
>>>>>>>   ---------------
>>>>>>> No surprises here. Whatever is disk throughput ceph throughput is
>>>>>>> almost matching.
>>>>>>>
>>>>>>>
>>>>>>> Let's tweak the shard/thread settings and see the impact.
>>>>>>>
>>>>>>>
>>>>>>> 2. OSD config with 36 shards and 1 thread/shard:
>>>>>>> -----------------------------------------------------------
>>>>>>>
>>>>>>>     Buffered read:
>>>>>>>     ------------------
>>>>>>>    No change, output is very similar to 25 shards.
>>>>>>>
>>>>>>>
>>>>>>>    direct_io read:
>>>>>>>    ------------------
>>>>>>>         Iostat:
>>>>>>>        ----------
>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>            33.33    0.00   28.22   23.11    0.00   15.34
>>>>>>>
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> sda               0.00     0.00    0.00    2.00     0.00    12.00
>>>>>>> 12.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdd               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sde               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdg               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdf               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdh               0.00     0.00 31987.00    0.00 127948.00     0.00
>>>>>>> 8.00    18.06    0.56    0.56    0.00   0.03 100.40
>>>>>>> sdc               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdb               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>>
>>>>>>>         ceph -s:
>>>>>>>      --------------
>>>>>>> root@emsclient:~/fio_test# ceph -s
>>>>>>>      cluster 94991097-7638-4240-b922-f525300a9026
>>>>>>>       health HEALTH_OK
>>>>>>>       monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch
>>>>>>> 1, quorum 0 a
>>>>>>>       osdmap e525: 1 osds: 1 up, 1 in
>>>>>>>        pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>>>>>>              366 GB used, 1122 GB / 1489 GB avail
>>>>>>>                   832 active+clean
>>>>>>>    client io 127 MB/s rd, 32763 op/s
>>>>>>>
>>>>>>>          cpu util:
>>>>>>>     --------------
>>>>>>>         ~19 core while serving from disks.
>>>>>>>
>>>>>>>           Analysis:
>>>>>>> ------------------
>>>>>>>          It is scaling with increased number of shards/threads. The
>>>>>>> parallelism also increased significantly.
>>>>>>>
>>>>>>>
>>>>>>> 3. OSD config with 48 shards and 1 thread/shard:
>>>>>>>   ----------------------------------------------------------
>>>>>>>      Buffered read:
>>>>>>>     -------------------
>>>>>>>      No change, output is very similar to 25 shards.
>>>>>>>
>>>>>>>
>>>>>>>     direct_io read:
>>>>>>>      -----------------
>>>>>>>         Iostat:
>>>>>>>        --------
>>>>>>>
>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>            37.50    0.00   33.72   20.03    0.00    8.75
>>>>>>>
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdd               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sde               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdg               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdf               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdh               0.00     0.00 35360.00    0.00 141440.00     0.00
>>>>>>> 8.00    22.25    0.62    0.62    0.00   0.03 100.40
>>>>>>> sdc               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdb               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>>
>>>>>>>           ceph -s:
>>>>>>>         --------------
>>>>>>> root@emsclient:~/fio_test# ceph -s
>>>>>>>      cluster 94991097-7638-4240-b922-f525300a9026
>>>>>>>       health HEALTH_OK
>>>>>>>       monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch
>>>>>>> 1, quorum 0 a
>>>>>>>       osdmap e534: 1 osds: 1 up, 1 in
>>>>>>>        pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>>>>>>              366 GB used, 1122 GB / 1489 GB avail
>>>>>>>                   832 active+clean
>>>>>>>    client io 138 MB/s rd, 35582 op/s
>>>>>>>
>>>>>>>           cpu util:
>>>>>>>   ----------------
>>>>>>>          ~22.5 core while serving from disks.
>>>>>>>
>>>>>>>            Analysis:
>>>>>>>   --------------------
>>>>>>>          It is scaling with increased number of shards/threads. The
>>>>>>> parallelism also increased significantly.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 4. OSD config with 64 shards and 1 thread/shard:
>>>>>>>   ---------------------------------------------------------
>>>>>>>        Buffered read:
>>>>>>>       ------------------
>>>>>>>       No change, output is very similar to 25 shards.
>>>>>>>
>>>>>>>
>>>>>>>       direct_io read:
>>>>>>>       -------------------
>>>>>>>         Iostat:
>>>>>>>        ---------
>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>            40.18    0.00   34.84   19.81    0.00    5.18
>>>>>>>
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdd               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sde               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdg               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdf               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdh               0.00     0.00 39114.00    0.00 156460.00     0.00
>>>>>>> 8.00    35.58    0.90    0.90    0.00   0.03 100.40
>>>>>>> sdc               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> sdb               0.00     0.00    0.00    0.00     0.00     0.00
>>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>>
>>>>>>>         ceph -s:
>>>>>>>   ---------------
>>>>>>> root@emsclient:~/fio_test# ceph -s
>>>>>>>      cluster 94991097-7638-4240-b922-f525300a9026
>>>>>>>       health HEALTH_OK
>>>>>>>       monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch
>>>>>>> 1, quorum 0 a
>>>>>>>       osdmap e537: 1 osds: 1 up, 1 in
>>>>>>>        pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>>>>>>              366 GB used, 1122 GB / 1489 GB avail
>>>>>>>                   832 active+clean
>>>>>>>    client io 153 MB/s rd, 39172 op/s
>>>>>>>
>>>>>>>        cpu util:
>>>>>>> ----------------
>>>>>>>      ~24.5 core while serving from disks. ~3% cpu left.
>>>>>>>
>>>>>>>         Analysis:
>>>>>>> ------------------
>>>>>>>        It is scaling with increased number of shards/threads. The
>>>>>>> parallelism also increased significantly. It is disk bound now.
>>>>>>>
>>>>>>>
>>>>>>> Summary:
>>>>>>>
>>>>>>> So, it seems buffered IO has significant impact on performance in
>>>>>>> case backend is SSD.
>>>>>>> My question is,  if the workload is very random and storage(SSD) is
>>>>>>> very huge compare to system memory, shouldn't we always go for direct_io
>>>>>>> instead of buffered io from Ceph ?
>>>>>>>
>>>>>>> Please share your thoughts/suggestion on this.
>>>>>>>
>>>>>>> Thanks & Regards
>>>>>>> Somnath
>>>>>>>
>>>>>>> ________________________________
>>>>>>>
>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>> message is intended only for the use of the designated recipient(s) named
>>>>>>> above. If the reader of this message is not the intended recipient, you are
>>>>>>> hereby notified that you have received this message in error and that any
>>>>>>> review, dissemination, distribution, or copying of this message is strictly
>>>>>>> prohibited. If you have received this communication in error, please notify
>>>>>>> the sender by telephone or e-mail (as shown above) immediately and destroy
>>>>>>> any and all copies of this message in your possession (whether hard copies
>>>>>>> or electronically stored copies).
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Milosz Tanski
>>>>>> CTO
>>>>>> 16 East 34th Street, 15th floor
>>>>>> New York, NY 10016
>>>>>>
>>>>>> p: 646-253-9055
>>>>>> e: milosz@xxxxxxxxx
>>>>>> N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
>>>>>> ?w??? ???j:+v???w???????? ????zZ+???????j"????i
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>>
>>>> Wheat
>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

I wonder how much (if any) would using posix_fadvise with the
POSIX_FADV_RANDOM hint help in this case? As that tells the kernel to
not perform (aggressive) read-ahead.

Sadly, POSIX_FADV_NOREUSE is a no-op in current kernels although
there's been patches floating over the years to implement it.
http://lxr.free-electrons.com/source/mm/fadvise.c#L113 and
http://thread.gmane.org/gmane.linux.file-systems/61511

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html