Re: CephFS read IO caching, where it is happining?

Ahmed Khuraidah <abushihab@xxxxxxxxx> · Wed, 8 Feb 2017 16:28:33 +0300

Alright, just redeployed Ubuntu box again. Here is what you requested (server machine - ubcephnode, client machine - ubpayload):

ahmed@ubcephnode:~$ ceph -v
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

ahmed@ubcephnode:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:        14.04
Codename:       trusty

ahmed@ubcephnode:~$ uname -a
Linux ubcephnode 4.4.0-62-generic #83~14.04.1-Ubuntu SMP Wed Jan 18 18:10:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

ahmed@ubpayload:~$ ceph -v
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

ahmed@ubpayload:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:        14.04
Codename:       trusty

ahmed@ubpayload:~$ uname -a
Linux ubpayload 4.4.0-62-generic #83~14.04.1-Ubuntu SMP Wed Jan 18 18:10:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

ahmed@ubcephnode:~$ cat /etc/ceph/ceph.conf
[global]
fsid = 7c39c59a-4951-4798-9c42-59da474afd26
mon_initial_members = ubcephnode
mon_host = 192.168.10.120
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_pool_default_size = 1
mds_log = false

ahmed@ubpayload:~$ mount
..
192.168.10.120:6789:/ on /mnt/mycephfs type ceph (name=admin,key=client.admin)

ahmed@ubcephnode:~$ ceph -s
    cluster 7c39c59a-4951-4798-9c42-59da474afd26
     health HEALTH_ERR
            mds rank 0 is damaged
            mds cluster is degraded
     monmap e1: 1 mons at {ubcephnode=192.168.10.120:6789/0}
            election epoch 3, quorum 0 ubcephnode
      fsmap e11: 0/1/1 up, 1 up:standby, 1 damaged
     osdmap e12: 1 osds: 1 up, 1 in
            flags sortbitwise,require_jewel_osds
      pgmap v32: 204 pgs, 3 pools, 3072 MB data, 787 objects
            3109 MB used, 48064 MB / 51173 MB avail
                 204 active+clean

--- begin dump of recent events ---     0> 2017-02-08 06:50:16.206926 7f306a642700 -1 *** Caught signal (Aborted) **
 in thread 7f306a642700 thread_name:ms_dispatch

 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
 1: (()+0x4f62b2) [0x556839b472b2]
 2: (()+0x10330) [0x7f307084a330]
 3: (gsignal()+0x37) [0x7f306ecd2c37]
 4: (abort()+0x148) [0x7f306ecd6028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x556839c3d135]
 6: (MutationImpl::~MutationImpl()+0x28e) [0x5568398f7b5e]
 7: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x39) [0x55683986ac49]
 8: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long, bool, unsigned long, utime_t)+0x9a7) [0x5568399e0947]
 9: (Locker::remove_client_cap(CInode*, client_t)+0xb1) [0x5568399e1ae1]
 10: (Locker::_do_cap_release(client_t, inodeno_t, unsigned long, unsigned int, unsigned int)+0x90d) [0x5568399e243d]
 11: (Locker::handle_client_cap_release(MClientCapRelease*)+0x1dc) [0x5568399e269c]
 12: (MDSRank::handle_deferrable_message(Message*)+0xc1c) [0x556839871dac]
 13: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x55683987aa01]
 14: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55683987bb55]
 15: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x556839863653]
 16: (DispatchQueue::entry()+0x78b) [0x556839d3772b]
 17: (DispatchQueue::DispatchThread::entry()+0xd) [0x556839c2280d]
 18: (()+0x8184) [0x7f3070842184]
 19: (clone()+0x6d) [0x7f306ed9637d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

On Wed, Feb 8, 2017 at 9:13 AM, Shinobu Kinjo <skinjo@xxxxxxxxxx> wrote:

On Wed, Feb 8, 2017 at 3:05 PM, Ahmed Khuraidah <abushihab@xxxxxxxxx> wrote:
Hi Shinobu, I am using SUSE packages in scope of their latest SUSE Enterprise Storage 4 and following documentation (method of deployment: ceph-deploy)
But, I was able reproduce this issue on Ubuntu 14.04 with Ceph repositories (also latest Jewel and ceph-deploy) as well.

Community Ceph packages are running on ubuntu box, right?
If so, please do `ceph -v` on ubuntu box.

And also please provide us with same issue which you hit on suse box.

On Wed, Feb 8, 2017 at 3:03 AM, Shinobu Kinjo <skinjo@xxxxxxxxxx> wrote:
Are you using opensource Ceph packages or suse ones?

On Sat, Feb 4, 2017 at 3:54 PM, Ahmed Khuraidah <abushihab@xxxxxxxxx> wrote:
I Have opened ticket on http://tracker.ceph.com/ 

http://tracker.ceph.com/issues/18816 

My client and server kernels are the same, here is info:
# lsb_release -a
LSB Version:    n/a
Distributor ID: SUSE
Description:    SUSE Linux Enterprise Server 12 SP2
Release:        12.2
Codename:       n/a
# uname -a
Linux cephnode 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux

Thanks

On Fri, Feb 3, 2017 at 1:59 PM, John Spray <jspray@xxxxxxxxxx> wrote:
On Fri, Feb 3, 2017 at 8:07 AM, Ahmed Khuraidah <abushihab@xxxxxxxxx> wrote:

> Thank you guys,

>

> I tried to add option "exec_prerun=echo 3 > /proc/sys/vm/drop_caches" as

> well as "exec_prerun=echo 3 | sudo tee /proc/sys/vm/drop_caches", but

> despite FIO corresponds that command was executed, there are no changes.

>

> But, I caught very strange another behavior. If I will run my FIO test

> (speaking about 3G file case) twice, after the first run FIO will create my

> file and print a lot of IOps as described already, but if- before second

> run- drop cache (by root echo 3 > /proc/sys/vm/drop_caches) I broke will end

> with broken MDS:

>

> --- begin dump of recent events ---

>      0> 2017-02-03 02:34:41.974639 7f7e8ec5e700 -1 *** Caught signal

> (Aborted) **

>  in thread 7f7e8ec5e700 thread_name:ms_dispatch

>

>  ceph version 10.2.4-211-g12b091b (12b091b4a40947aa43919e71a318ed0dcedc8734)

>  1: (()+0x5142a2) [0x557c51e092a2]

>  2: (()+0x10b00) [0x7f7e95df2b00]

>  3: (gsignal()+0x37) [0x7f7e93ccb8d7]

>  4: (abort()+0x13a) [0x7f7e93ccccaa]

>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char

> const*)+0x265) [0x557c51f133d5]

>  6: (MutationImpl::~MutationImpl()+0x28e) [0x557c51bb9e1e]

>  7: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x39)

> [0x557c51b2ccf9]

>  8: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long, bool,

> unsigned long, utime_t)+0x9a7) [0x557c51ca2757]

>  9: (Locker::remove_client_cap(CInode*, client_t)+0xb1) [0x557c51ca38f1]

>  10: (Locker::_do_cap_release(client_t, inodeno_t, unsigned long, unsigned

> int, unsigned int)+0x90d) [0x557c51ca424d]

>  11: (Locker::handle_client_cap_release(MClientCapRelease*)+0x1cc)

> [0x557c51ca449c]

>  12: (MDSRank::handle_deferrable_message(Message*)+0xc1c) [0x557c51b33d3c]

>  13: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x557c51b3c991]

>  14: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x557c51b3dae5]

>  15: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x557c51b25703]

>  16: (DispatchQueue::entry()+0x78b) [0x557c5200d06b]

>  17: (DispatchQueue::DispatchThread::entry()+0xd) [0x557c51ee5dcd]

>  18: (()+0x8734) [0x7f7e95dea734]

>  19: (clone()+0x6d) [0x7f7e93d80d3d]

>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to

> interpret this.

Oops!  Please could you open a ticket on tracker.ceph.com, with this

backtrace, the client versions, any non-default config settings, and

the series of operations that led up to it.

Thanks,

John

> "

>

> On Thu, Feb 2, 2017 at 9:30 PM, Shinobu Kinjo <skinjo@xxxxxxxxxx> wrote:

>>

>> You may want to add this in your FIO recipe.

>>

>>  * exec_prerun=echo 3 > /proc/sys/vm/drop_caches

>>

>> Regards,

>>

>> On Fri, Feb 3, 2017 at 12:36 AM, Wido den Hollander <wido@xxxxxxxx> wrote:

>> >

>> >> Op 2 februari 2017 om 15:35 schreef Ahmed Khuraidah

>> >> <abushihab@xxxxxxxxx>:

>> >>

>> >>

>> >> Hi all,

>> >>

>> >> I am still confused about my CephFS sandbox.

>> >>

>> >> When I am performing simple FIO test into single file with size of 3G I

>> >> have too many IOps:

>> >>

>> >> cephnode:~ # fio payloadrandread64k3G

>> >> test: (g=0): rw=randread, bs=64K-64K/64K-64K/64K-64K, ioengine=libaio,

>> >> iodepth=2

>> >> fio-2.13

>> >> Starting 1 process

>> >> test: Laying out IO file(s) (1 file(s) / 3072MB)

>> >> Jobs: 1 (f=1): [r(1)] [100.0% done] [277.8MB/0KB/0KB /s] [4444/0/0

>> >> iops]

>> >> [eta 00m:00s]

>> >> test: (groupid=0, jobs=1): err= 0: pid=3714: Thu Feb  2 07:07:01 2017

>> >>   read : io=3072.0MB, bw=181101KB/s, iops=2829, runt= 17370msec

>> >>     slat (usec): min=4, max=386, avg=12.49, stdev= 6.90

>> >>     clat (usec): min=202, max=5673.5K, avg=690.81, stdev=361

>> >>

>> >>

>> >> But if I will change size to file to 320G, looks like I skip the cache:

>> >>

>> >> cephnode:~ # fio payloadrandread64k320G

>> >> test: (g=0): rw=randread, bs=64K-64K/64K-64K/64K-64K, ioengine=libaio,

>> >> iodepth=2

>> >> fio-2.13

>> >> Starting 1 process

>> >> Jobs: 1 (f=1): [r(1)] [100.0% done] [4740KB/0KB/0KB /s] [74/0/0 iops]

>> >> [eta

>> >> 00m:00s]

>> >> test: (groupid=0, jobs=1): err= 0: pid=3624: Thu Feb  2 06:51:09 2017

>> >>   read : io=3410.9MB, bw=11641KB/s, iops=181, runt=300033msec

>> >>     slat (usec): min=4, max=442, avg=14.43, stdev=10.07

>> >>     clat (usec): min=98, max=286265, avg=10976.32, stdev=14904.82

>> >>

>> >>

>> >> For random write test such behavior not exists, there are almost the

>> >> same

>> >> results - around 100 IOps.

>> >>

>> >> So my question: could please somebody clarify where this caching likely

>> >> happens and how to manage it?

>> >>

>> >

>> > The page cache of your kernel. The kernel will cache the file in memory

>> > and perform read operations from there.

>> >

>> > Best way is to reboot your client between test runs. Although you can

>> > drop kernel caches I always reboot to make sure nothing is cached locally.

>> >

>> > Wido

>> >

>> >> P.S.

>> >> This is latest SLES/Jewel based onenode setup which has:

>> >> 1 MON, 1 MDS (both data and metadata pools on SATA drive) and 1 OSD

>> >> (XFS on

>> >> SATA and journal on SSD).

>> >> My FIO config file:

>> >> direct=1

>> >> buffered=0

>> >> ioengine=libaio

>> >> iodepth=2

>> >> runtime=300

>> >>

>> >> Thanks

>> >> _______________________________________________

>> >> ceph-users mailing list

>> >> ceph-users@xxxxxxxxxxxxxx

>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> > _______________________________________________

>> > ceph-users mailing list

>> > ceph-users@xxxxxxxxxxxxxx

>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com