Re: CephFS read IO caching, where it is happining?

Wido den Hollander <wido@xxxxxxxx> · Fri, 3 Feb 2017 11:54:57 +0100 (CET)

> Op 3 februari 2017 om 9:07 schreef Ahmed Khuraidah <abushihab@xxxxxxxxx>:
> 
> 
> Thank you guys,
> 
> I tried to add option "exec_prerun=echo 3 > /proc/sys/vm/drop_caches" as
> well as "exec_prerun=echo 3 | sudo tee /proc/sys/vm/drop_caches", but
> despite FIO corresponds that command was executed, there are no changes.
> 

Try a reboot in between, that way you are sure the cache is clean.

> But, I caught very strange another behavior. If I will run my FIO test
> (speaking about 3G file case) twice, after the first run FIO will create my
> file and print a lot of IOps as described already, but if- before second
> run- drop cache (by root echo 3 > /proc/sys/vm/drop_caches) I broke will
> end with broken MDS:
> 

That's not good! The MDS should not crash. What kernel version are you using as a client on which OS?

$ lsb_release -a
$ uname -a

Wido

> --- begin dump of recent events ---
>      0> 2017-02-03 02:34:41.974639 7f7e8ec5e700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f7e8ec5e700 thread_name:ms_dispatch
> 
>  ceph version 10.2.4-211-g12b091b (12b091b4a40947aa43919e71a318ed0dcedc8734)
>  1: (()+0x5142a2) [0x557c51e092a2]
>  2: (()+0x10b00) [0x7f7e95df2b00]
>  3: (gsignal()+0x37) [0x7f7e93ccb8d7]
>  4: (abort()+0x13a) [0x7f7e93ccccaa]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x265) [0x557c51f133d5]
>  6: (MutationImpl::~MutationImpl()+0x28e) [0x557c51bb9e1e]
>  7: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x39)
> [0x557c51b2ccf9]
>  8: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long, bool,
> unsigned long, utime_t)+0x9a7) [0x557c51ca2757]
>  9: (Locker::remove_client_cap(CInode*, client_t)+0xb1) [0x557c51ca38f1]
>  10: (Locker::_do_cap_release(client_t, inodeno_t, unsigned long, unsigned
> int, unsigned int)+0x90d) [0x557c51ca424d]
>  11: (Locker::handle_client_cap_release(MClientCapRelease*)+0x1cc)
> [0x557c51ca449c]
>  12: (MDSRank::handle_deferrable_message(Message*)+0xc1c) [0x557c51b33d3c]
>  13: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x557c51b3c991]
>  14: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x557c51b3dae5]
>  15: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x557c51b25703]
>  16: (DispatchQueue::entry()+0x78b) [0x557c5200d06b]
>  17: (DispatchQueue::DispatchThread::entry()+0xd) [0x557c51ee5dcd]
>  18: (()+0x8734) [0x7f7e95dea734]
>  19: (clone()+0x6d) [0x7f7e93d80d3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
> "
> 
> On Thu, Feb 2, 2017 at 9:30 PM, Shinobu Kinjo <skinjo@xxxxxxxxxx> wrote:
> 
> > You may want to add this in your FIO recipe.
> >
> >  * exec_prerun=echo 3 > /proc/sys/vm/drop_caches
> >
> > Regards,
> >
> > On Fri, Feb 3, 2017 at 12:36 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> > >
> > >> Op 2 februari 2017 om 15:35 schreef Ahmed Khuraidah <
> > abushihab@xxxxxxxxx>:
> > >>
> > >>
> > >> Hi all,
> > >>
> > >> I am still confused about my CephFS sandbox.
> > >>
> > >> When I am performing simple FIO test into single file with size of 3G I
> > >> have too many IOps:
> > >>
> > >> cephnode:~ # fio payloadrandread64k3G
> > >> test: (g=0): rw=randread, bs=64K-64K/64K-64K/64K-64K, ioengine=libaio,
> > >> iodepth=2
> > >> fio-2.13
> > >> Starting 1 process
> > >> test: Laying out IO file(s) (1 file(s) / 3072MB)
> > >> Jobs: 1 (f=1): [r(1)] [100.0% done] [277.8MB/0KB/0KB /s] [4444/0/0 iops]
> > >> [eta 00m:00s]
> > >> test: (groupid=0, jobs=1): err= 0: pid=3714: Thu Feb  2 07:07:01 2017
> > >>   read : io=3072.0MB, bw=181101KB/s, iops=2829, runt= 17370msec
> > >>     slat (usec): min=4, max=386, avg=12.49, stdev= 6.90
> > >>     clat (usec): min=202, max=5673.5K, avg=690.81, stdev=361
> > >>
> > >>
> > >> But if I will change size to file to 320G, looks like I skip the cache:
> > >>
> > >> cephnode:~ # fio payloadrandread64k320G
> > >> test: (g=0): rw=randread, bs=64K-64K/64K-64K/64K-64K, ioengine=libaio,
> > >> iodepth=2
> > >> fio-2.13
> > >> Starting 1 process
> > >> Jobs: 1 (f=1): [r(1)] [100.0% done] [4740KB/0KB/0KB /s] [74/0/0 iops]
> > [eta
> > >> 00m:00s]
> > >> test: (groupid=0, jobs=1): err= 0: pid=3624: Thu Feb  2 06:51:09 2017
> > >>   read : io=3410.9MB, bw=11641KB/s, iops=181, runt=300033msec
> > >>     slat (usec): min=4, max=442, avg=14.43, stdev=10.07
> > >>     clat (usec): min=98, max=286265, avg=10976.32, stdev=14904.82
> > >>
> > >>
> > >> For random write test such behavior not exists, there are almost the
> > same
> > >> results - around 100 IOps.
> > >>
> > >> So my question: could please somebody clarify where this caching likely
> > >> happens and how to manage it?
> > >>
> > >
> > > The page cache of your kernel. The kernel will cache the file in memory
> > and perform read operations from there.
> > >
> > > Best way is to reboot your client between test runs. Although you can
> > drop kernel caches I always reboot to make sure nothing is cached locally.
> > >
> > > Wido
> > >
> > >> P.S.
> > >> This is latest SLES/Jewel based onenode setup which has:
> > >> 1 MON, 1 MDS (both data and metadata pools on SATA drive) and 1 OSD
> > (XFS on
> > >> SATA and journal on SSD).
> > >> My FIO config file:
> > >> direct=1
> > >> buffered=0
> > >> ioengine=libaio
> > >> iodepth=2
> > >> runtime=300
> > >>
> > >> Thanks
> > >> _______________________________________________
> > >> ceph-users mailing list
> > >> ceph-users@xxxxxxxxxxxxxx
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com