On Wed, Sep 24, 2014 at 9:27 AM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote: > On 09/24/2014 07:38 AM, Sage Weil wrote: >> >> On Wed, 24 Sep 2014, Haomai Wang wrote: >>> >>> I agree with that direct read will help for disk read. But if read data >>> is hot and small enough to fit in memory, page cache is a good place to >>> hold data cache. If discard page cache, we need to implement a cache to >>> provide with effective lookup impl. >> >> >> This is true for some workloads, but not necessarily true for all. Many >> clients (notably RBD) will be caching at the client side (in VM's fs, and >> possibly in librbd itself) such that caching at the OSD is largely wasted >> effort. For RGW the often is likely true, unless there is a varnish cache >> or something in front. >> >> We should probably have a direct_io config option for filestore. But even >> better would be some hint from the client about whether it is caching or >> not so that FileStore could conditionally cache... > > > I like the hinting idea. Having said that, if the effect being seen is due > to page cache, it seems like something is off. We've seen performance > issues in the kernel before so it's not unprecedented. Working around it > with direct IO could be the right way to go, but it might be that this is > something that could be fixed higher up and improve performance in other > scenarios too. I'd hate to let it go by the wayside of we could find > something actionable. > > >> >> sage >> >> > >>> >>> BTW, whether to use direct io we can refer to MySQL Innodb engine with >>> direct io and PostgreSQL with page cache. >>> >>> On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> >>> wrote: >>>> >>>> Haomai, >>>> I am considering only about random reads and the changes I made only >>>> affecting reads. For write, I have not measured yet. But, yes, page cache >>>> may be helpful for write coalescing. Still need to evaluate how it is >>>> behaving comparing direct_io on SSD though. I think Ceph code path will be >>>> much shorter if we use direct_io in the write path where it is actually >>>> executing the transactions. Probably, the sync thread and all will not be >>>> needed. >>>> >>>> I am trying to analyze where is the extra reads coming from in case of >>>> buffered io by using blktrace etc. This should give us a clear understanding >>>> what exactly is going on there and it may turn out that tuning kernel >>>> parameters only we can achieve similar performance as direct_io. >>>> >>>> Thanks & Regards >>>> Somnath >>>> >>>> -----Original Message----- >>>> From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx] >>>> Sent: Tuesday, September 23, 2014 7:07 PM >>>> To: Sage Weil >>>> Cc: Somnath Roy; Milosz Tanski; ceph-devel@xxxxxxxxxxxxxxx >>>> Subject: Re: Impact of page cache on OSD read performance for SSD >>>> >>>> Good point, but do you have considered that the impaction for write ops? >>>> And if skip page cache, FileStore is responsible for data cache? >>>> >>>> On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: >>>>> >>>>> On Tue, 23 Sep 2014, Somnath Roy wrote: >>>>>> >>>>>> Milosz, >>>>>> Thanks for the response. I will see if I can get any information out >>>>>> of perf. >>>>>> >>>>>> Here is my OS information. >>>>>> >>>>>> root@emsclient:~# lsb_release -a >>>>>> No LSB modules are available. >>>>>> Distributor ID: Ubuntu >>>>>> Description: Ubuntu 13.10 >>>>>> Release: 13.10 >>>>>> Codename: saucy >>>>>> root@emsclient:~# uname -a >>>>>> Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 >>>>>> UTC 2013 x86_64 x86_64 x86_64 GNU/Linux >>>>>> >>>>>> BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter >>>>>> I was able to get almost *2X* performance improvement with direct_io. >>>>>> It's not only page cache (memory) lookup, in case of buffered_io the >>>>>> following could be problem. >>>>>> >>>>>> 1. Double copy (disk -> file buffer cache, file buffer cache -> user >>>>>> buffer) >>>>>> >>>>>> 2. As the iostat output shows, it is not reading 4K only, it is >>>>>> reading more data from disk as required and in the end it will be >>>>>> wasted in case of random workload.. >>>>> >>>>> >>>>> It might be worth using blktrace to see what the IOs it is issueing >>>>> are. >>>>> Which ones are > 4K and what they point to... >>>>> >>>>> sage >>>>> >>>>> >>>>>> >>>>>> Thanks & Regards >>>>>> Somnath >>>>>> >>>>>> -----Original Message----- >>>>>> From: Milosz Tanski [mailto:milosz@xxxxxxxxx] >>>>>> Sent: Tuesday, September 23, 2014 12:09 PM >>>>>> To: Somnath Roy >>>>>> Cc: ceph-devel@xxxxxxxxxxxxxxx >>>>>> Subject: Re: Impact of page cache on OSD read performance for SSD >>>>>> >>>>>> Somnath, >>>>>> >>>>>> I wonder if there's a bottleneck or a point of contention for the >>>>>> kernel. For a entirely uncached workload I expect the page cache lookup to >>>>>> cause a slow down (since the lookup should be wasted). What I wouldn't >>>>>> expect is a 45% performance drop. Memory speed should be one magnitude >>>>>> faster then a modern SATA SSD drive (so it should be more negligible >>>>>> overhead). >>>>>> >>>>>> Is there anyway you could perform the same test but monitor what's >>>>>> going on with the OSD process using the perf tool? Whatever is the default >>>>>> cpu time spent hardware counter is fine. Make sure you have the kernel debug >>>>>> info package installed so can get symbol information for kernel and module >>>>>> calls. With any luck the diff in perf output in two runs will show us the >>>>>> culprit. >>>>>> >>>>>> Also, can you tell us what OS/kernel version you're using on the OSD >>>>>> machines? >>>>>> >>>>>> - Milosz >>>>>> >>>>>> On Tue, Sep 23, 2014 at 2:05 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> >>>>>> wrote: >>>>>>> >>>>>>> Hi Sage, >>>>>>> I have created the following setup in order to examine how a single >>>>>>> OSD is behaving if say ~80-90% of ios hitting the SSDs. >>>>>>> >>>>>>> My test includes the following steps. >>>>>>> >>>>>>> 1. Created a single OSD cluster. >>>>>>> 2. Created two rbd images (110GB each) on 2 different pools. >>>>>>> 3. Populated entire image, so my working set is ~210GB. My >>>>>>> system memory is ~16GB. >>>>>>> 4. Dumped page cache before every run. >>>>>>> 5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two >>>>>>> images. >>>>>>> >>>>>>> Here is my disk iops/bandwidth.. >>>>>>> >>>>>>> root@emsclient:~/fio_test# fio rad_resd_disk.job >>>>>>> random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, >>>>>>> ioengine=libaio, iodepth=64 >>>>>>> 2.0.8 >>>>>>> Starting 1 process >>>>>>> Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 >>>>>>> iops] [eta 00m:00s] >>>>>>> random-reads: (groupid=0, jobs=1): err= 0: pid=1431 >>>>>>> read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= >>>>>>> 60002msec >>>>>>> >>>>>>> My fio_rbd config.. >>>>>>> >>>>>>> [global] >>>>>>> ioengine=rbd >>>>>>> clientname=admin >>>>>>> pool=rbd1 >>>>>>> rbdname=ceph_regression_test1 >>>>>>> invalidate=0 # mandatory >>>>>>> rw=randread >>>>>>> bs=4k >>>>>>> direct=1 >>>>>>> time_based >>>>>>> runtime=2m >>>>>>> size=109G >>>>>>> numjobs=8 >>>>>>> [rbd_iodepth32] >>>>>>> iodepth=32 >>>>>>> >>>>>>> Now, I have run Giant Ceph on top of that.. >>>>>>> >>>>>>> 1. OSD config with 25 shards/1 thread per shard : >>>>>>> ------------------------------------------------------- >>>>>>> >>>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>>> 22.04 0.00 16.46 45.86 0.00 15.64 >>>>>>> >>>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> sda 0.00 9.00 0.00 6.00 0.00 92.00 >>>>>>> 30.67 0.01 1.33 0.00 1.33 1.33 0.80 >>>>>>> sdd 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sde 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdg 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdf 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 >>>>>>> 10.11 102.71 2.92 2.92 0.00 0.03 100.00 >>>>>>> sdc 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdb 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> >>>>>>> >>>>>>> ceph -s: >>>>>>> ---------- >>>>>>> root@emsclient:~# ceph -s >>>>>>> cluster 94991097-7638-4240-b922-f525300a9026 >>>>>>> health HEALTH_OK >>>>>>> monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch >>>>>>> 1, quorum 0 a >>>>>>> osdmap e498: 1 osds: 1 up, 1 in >>>>>>> pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects >>>>>>> 366 GB used, 1122 GB / 1489 GB avail >>>>>>> 832 active+clean >>>>>>> client io 75215 kB/s rd, 18803 op/s >>>>>>> >>>>>>> cpu util: >>>>>>> ---------- >>>>>>> Gradually decreases from ~21 core (serving from cache) to ~10 core >>>>>>> (while serving from disks). >>>>>>> >>>>>>> My Analysis: >>>>>>> ----------------- >>>>>>> In this case "All is Well" till ios are served from cache (XFS is >>>>>>> smart enough to cache some data ) . Once started hitting disks and >>>>>>> throughput is decreasing. As you can see, disk is giving ~35K iops , but, >>>>>>> OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems >>>>>>> to be very expensive. Half of the iops are waste. Also, looking at the >>>>>>> bandwidth, it is obvious, not everything is 4K read, May be kernel >>>>>>> read_ahead is kicking (?). >>>>>>> >>>>>>> >>>>>>> Now, I thought of making ceph disk read as direct_io and do the same >>>>>>> experiment. I have changed the FileStore::read to do the direct_io only. >>>>>>> Rest kept as is. Here is the result with that. >>>>>>> >>>>>>> >>>>>>> Iostat: >>>>>>> ------- >>>>>>> >>>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>>> 24.77 0.00 19.52 21.36 0.00 34.36 >>>>>>> >>>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdd 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sde 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdg 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdf 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 >>>>>>> 8.00 12.73 0.50 0.50 0.00 0.04 100.80 >>>>>>> sdc 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdb 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> >>>>>>> ceph -s: >>>>>>> -------- >>>>>>> root@emsclient:~/fio_test# ceph -s >>>>>>> cluster 94991097-7638-4240-b922-f525300a9026 >>>>>>> health HEALTH_OK >>>>>>> monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch >>>>>>> 1, quorum 0 a >>>>>>> osdmap e522: 1 osds: 1 up, 1 in >>>>>>> pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects >>>>>>> 366 GB used, 1122 GB / 1489 GB avail >>>>>>> 832 active+clean >>>>>>> client io 100 MB/s rd, 25618 op/s >>>>>>> >>>>>>> cpu util: >>>>>>> -------- >>>>>>> ~14 core while serving from disks. >>>>>>> >>>>>>> My Analysis: >>>>>>> --------------- >>>>>>> No surprises here. Whatever is disk throughput ceph throughput is >>>>>>> almost matching. >>>>>>> >>>>>>> >>>>>>> Let's tweak the shard/thread settings and see the impact. >>>>>>> >>>>>>> >>>>>>> 2. OSD config with 36 shards and 1 thread/shard: >>>>>>> ----------------------------------------------------------- >>>>>>> >>>>>>> Buffered read: >>>>>>> ------------------ >>>>>>> No change, output is very similar to 25 shards. >>>>>>> >>>>>>> >>>>>>> direct_io read: >>>>>>> ------------------ >>>>>>> Iostat: >>>>>>> ---------- >>>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>>> 33.33 0.00 28.22 23.11 0.00 15.34 >>>>>>> >>>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> sda 0.00 0.00 0.00 2.00 0.00 12.00 >>>>>>> 12.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdd 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sde 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdg 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdf 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 >>>>>>> 8.00 18.06 0.56 0.56 0.00 0.03 100.40 >>>>>>> sdc 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdb 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> >>>>>>> ceph -s: >>>>>>> -------------- >>>>>>> root@emsclient:~/fio_test# ceph -s >>>>>>> cluster 94991097-7638-4240-b922-f525300a9026 >>>>>>> health HEALTH_OK >>>>>>> monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch >>>>>>> 1, quorum 0 a >>>>>>> osdmap e525: 1 osds: 1 up, 1 in >>>>>>> pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects >>>>>>> 366 GB used, 1122 GB / 1489 GB avail >>>>>>> 832 active+clean >>>>>>> client io 127 MB/s rd, 32763 op/s >>>>>>> >>>>>>> cpu util: >>>>>>> -------------- >>>>>>> ~19 core while serving from disks. >>>>>>> >>>>>>> Analysis: >>>>>>> ------------------ >>>>>>> It is scaling with increased number of shards/threads. The >>>>>>> parallelism also increased significantly. >>>>>>> >>>>>>> >>>>>>> 3. OSD config with 48 shards and 1 thread/shard: >>>>>>> ---------------------------------------------------------- >>>>>>> Buffered read: >>>>>>> ------------------- >>>>>>> No change, output is very similar to 25 shards. >>>>>>> >>>>>>> >>>>>>> direct_io read: >>>>>>> ----------------- >>>>>>> Iostat: >>>>>>> -------- >>>>>>> >>>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>>> 37.50 0.00 33.72 20.03 0.00 8.75 >>>>>>> >>>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdd 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sde 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdg 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdf 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 >>>>>>> 8.00 22.25 0.62 0.62 0.00 0.03 100.40 >>>>>>> sdc 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdb 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> >>>>>>> ceph -s: >>>>>>> -------------- >>>>>>> root@emsclient:~/fio_test# ceph -s >>>>>>> cluster 94991097-7638-4240-b922-f525300a9026 >>>>>>> health HEALTH_OK >>>>>>> monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch >>>>>>> 1, quorum 0 a >>>>>>> osdmap e534: 1 osds: 1 up, 1 in >>>>>>> pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects >>>>>>> 366 GB used, 1122 GB / 1489 GB avail >>>>>>> 832 active+clean >>>>>>> client io 138 MB/s rd, 35582 op/s >>>>>>> >>>>>>> cpu util: >>>>>>> ---------------- >>>>>>> ~22.5 core while serving from disks. >>>>>>> >>>>>>> Analysis: >>>>>>> -------------------- >>>>>>> It is scaling with increased number of shards/threads. The >>>>>>> parallelism also increased significantly. >>>>>>> >>>>>>> >>>>>>> >>>>>>> 4. OSD config with 64 shards and 1 thread/shard: >>>>>>> --------------------------------------------------------- >>>>>>> Buffered read: >>>>>>> ------------------ >>>>>>> No change, output is very similar to 25 shards. >>>>>>> >>>>>>> >>>>>>> direct_io read: >>>>>>> ------------------- >>>>>>> Iostat: >>>>>>> --------- >>>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>>> 40.18 0.00 34.84 19.81 0.00 5.18 >>>>>>> >>>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdd 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sde 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdg 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdf 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 >>>>>>> 8.00 35.58 0.90 0.90 0.00 0.03 100.40 >>>>>>> sdc 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> sdb 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> >>>>>>> ceph -s: >>>>>>> --------------- >>>>>>> root@emsclient:~/fio_test# ceph -s >>>>>>> cluster 94991097-7638-4240-b922-f525300a9026 >>>>>>> health HEALTH_OK >>>>>>> monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch >>>>>>> 1, quorum 0 a >>>>>>> osdmap e537: 1 osds: 1 up, 1 in >>>>>>> pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects >>>>>>> 366 GB used, 1122 GB / 1489 GB avail >>>>>>> 832 active+clean >>>>>>> client io 153 MB/s rd, 39172 op/s >>>>>>> >>>>>>> cpu util: >>>>>>> ---------------- >>>>>>> ~24.5 core while serving from disks. ~3% cpu left. >>>>>>> >>>>>>> Analysis: >>>>>>> ------------------ >>>>>>> It is scaling with increased number of shards/threads. The >>>>>>> parallelism also increased significantly. It is disk bound now. >>>>>>> >>>>>>> >>>>>>> Summary: >>>>>>> >>>>>>> So, it seems buffered IO has significant impact on performance in >>>>>>> case backend is SSD. >>>>>>> My question is, if the workload is very random and storage(SSD) is >>>>>>> very huge compare to system memory, shouldn't we always go for direct_io >>>>>>> instead of buffered io from Ceph ? >>>>>>> >>>>>>> Please share your thoughts/suggestion on this. >>>>>>> >>>>>>> Thanks & Regards >>>>>>> Somnath >>>>>>> >>>>>>> ________________________________ >>>>>>> >>>>>>> PLEASE NOTE: The information contained in this electronic mail >>>>>>> message is intended only for the use of the designated recipient(s) named >>>>>>> above. If the reader of this message is not the intended recipient, you are >>>>>>> hereby notified that you have received this message in error and that any >>>>>>> review, dissemination, distribution, or copying of this message is strictly >>>>>>> prohibited. If you have received this communication in error, please notify >>>>>>> the sender by telephone or e-mail (as shown above) immediately and destroy >>>>>>> any and all copies of this message in your possession (whether hard copies >>>>>>> or electronically stored copies). >>>>>>> >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More >>>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Milosz Tanski >>>>>> CTO >>>>>> 16 East 34th Street, 15th floor >>>>>> New York, NY 10016 >>>>>> >>>>>> p: 646-253-9055 >>>>>> e: milosz@xxxxxxxxx >>>>>> N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h????? >>>>>> ?w??? ???j:+v???w???????? ????zZ+???????j"????i >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>>>> info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> >>>> Wheat >>> >>> >>> >>> >>> -- >>> Best Regards, >>> >>> Wheat >>> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > I wonder how much (if any) would using posix_fadvise with the POSIX_FADV_RANDOM hint help in this case? As that tells the kernel to not perform (aggressive) read-ahead. Sadly, POSIX_FADV_NOREUSE is a no-op in current kernels although there's been patches floating over the years to implement it. http://lxr.free-electrons.com/source/mm/fadvise.c#L113 and http://thread.gmane.org/gmane.linux.file-systems/61511 -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: milosz@xxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html