On Wed, Jun 6, 2018 at 5:52 AM, Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote: > We're using CephFS with Luminous 12.2.5 and the fuse client (on CentOS 7.4, > kernel 3.10.0-693.5.2.el7.x86_64). Performance has been very good > generally, but we're currently running into some strange performance issues > with one of our applications. The client in this case is on a higher > latency link - it is about 2.5ms away from all the ceph server nodes (all > ceph server nodes are near each other on 10/40Ggbps local ethernet, only the > client is "away"). > > The application is reading contiguous data at 64k chunks, the strace (-tt -T > flags) looks something like: > > 06:37:04.152667 read(3, ".:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./"..., 65536) = > 65536 <0.024052> > 06:37:04.178432 read(3, ",1523\t./.:.:.:.:.\t0/0:34,0:34:99"..., 65536) = > 65536 <0.023990> > 06:37:04.204087 read(3, ":20:21:0,21,738\t0/0:8,0:8:0:0,0,"..., 65536) = > 65536 <0.024053> > 06:37:04.229919 read(3, "665\t0/0:35,0:35:99:0,102,1530\t./"..., 65536) = > 65536 <0.024574> > 06:37:04.255623 read(3, ":37:99:0,99,1485\t0/0:34,0:34:99:"..., 65536) = > 65536 <0.023795> > 06:37:04.280914 read(3, ":.\t./.:.:.:.:.:.:.\t./.:.:.:.:.:."..., 65536) = > 65536 <0.023614> > 06:37:04.306022 read(3, "0,0,0\t./.:0,0:0:.:0,0,0\t./.:0,0:"..., 65536) = > 65536 <0.024037> > > > so each 64k read takes about 23-24ms. The client has the file open for > read, the machine is not busy (load of 0.2), neither are the ceph nodes. > The fuse client seems pretty idle also. > > Increasing the log level to 20 for 'client' and 'objectcacher' on ceph-fuse, > it looks like ceph-fuse gets ll_read requests of 4k in size, and it looks > like it does an async read from the OSDs in 4k chunks (if I'm interpreting > the logs right). Here is a trace of one ll_read: > > 2018-06-06 08:14:17.609495 7fffe7a35700 3 client.16794661 ll_read > 0x5556dadfc1a0 0x1000d092e5f 238173646848~4096 > 2018-06-06 08:14:17.609506 7fffe7a35700 10 client.16794661 get_caps > 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31 > cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0} mode=100664 > size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 > caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects > 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) have > pAsLsXsFsxcrwb need Fr want Fc revoking - > 2018-06-06 08:14:17.609517 7fffe7a35700 10 client.16794661 _read_async > 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31 > cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 > size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 > caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects > 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) 238173646848~4096 > 2018-06-06 08:14:17.609523 7fffe7a35700 10 client.16794661 > min_bytes=4194304 max_bytes=268435456 max_periods=64 > 2018-06-06 08:14:17.609528 7fffe7a35700 10 objectcacher readx > extent(1000d092e5f.0000ddd1 (56785) in @6 94208~4096 -> [0,4096]) > 2018-06-06 08:14:17.609532 7fffe7a35700 10 > objectcacher.object(1000d092e5f.0000ddd1/head) map_read 1000d092e5f.0000ddd1 > 94208~4096 > 2018-06-06 08:14:17.609535 7fffe7a35700 20 > objectcacher.object(1000d092e5f.0000ddd1/head) map_read miss 4096 left, bh[ > 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {} > 2018-06-06 08:14:17.609537 7fffe7a35700 7 objectcacher bh_read on bh[ > 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {} > outstanding reads 0 > 2018-06-06 08:14:17.609576 7fffe7a35700 10 objectcacher readx missed, > waiting on bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters > = {} off 94208 > 2018-06-06 08:14:17.609579 7fffe7a35700 20 objectcacher readx defer > 0x55570211ec00 > 2018-06-06 08:14:17.609580 7fffe7a35700 5 client.16794661 get_cap_ref got > first FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31 > cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 > size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 > caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects > 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) > 2018-06-06 08:14:17.609587 7fffe7a35700 15 inode.get on 0x55555f138680 > 0x1000d092e5f.head now 4 > 2018-06-06 08:14:17.612318 7fffefa45700 7 objectcacher bh_read_finish > 1000d092e5f.0000ddd1/head tid 29067611 94208~4096 (bl is 4096) returned 0 > outstanding reads 1 > 2018-06-06 08:14:17.612338 7fffefa45700 20 objectcacher checking bh bh[ > 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters = { > 94208->[0x5557007383a0, ]} > 2018-06-06 08:14:17.612341 7fffefa45700 10 objectcacher bh_read_finish read > bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46] > waiters = {} > 2018-06-06 08:14:17.612344 7fffefa45700 10 > objectcacher.object(1000d092e5f.0000ddd1/head) try_merge_bh bh[ > 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46] > waiters = {} > 2018-06-06 08:14:17.612346 7fffefa45700 20 objectcacher finishing waiters > 0x5557007383a0 > 2018-06-06 08:14:17.612347 7fffefa45700 10 objectcacher readx > extent(1000d092e5f.0000ddd1 (56785) in @6 94208~4096 -> [0,4096]) > 2018-06-06 08:14:17.612349 7fffefa45700 10 > objectcacher.object(1000d092e5f.0000ddd1/head) map_read 1000d092e5f.0000ddd1 > 94208~4096 > 2018-06-06 08:14:17.612350 7fffefa45700 20 > objectcacher.object(1000d092e5f.0000ddd1/head) map_read hit bh[ > 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46] > waiters = {} > 2018-06-06 08:14:17.612352 7fffefa45700 10 objectcacher readx hit bh bh[ > 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46] > waiters = {} > 2018-06-06 08:14:17.612354 7fffefa45700 10 objectcacher readx rmap opos > 94208: bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean > firstbyte=46] waiters = {} +0 frag 0~4096 +0~4096 > 2018-06-06 08:14:17.612357 7fffefa45700 10 objectcacher readx has all > buffers > 2018-06-06 08:14:17.612357 7fffefa45700 10 objectcacher readx adding buffer > len 4096 at 0 > 2018-06-06 08:14:17.612358 7fffefa45700 10 objectcacher readx result is > 4096 > 2018-06-06 08:14:17.612359 7fffefa45700 20 objectcacher readx done > 0x55570211ec00 4096 > 2018-06-06 08:14:17.612360 7fffefa45700 10 objectcacher trim start: bytes: > max 2147483640 clean 2145931264, objects: max 8192 current 8192 > 2018-06-06 08:14:17.612361 7fffefa45700 10 objectcacher trim finish: max > 2147483640 clean 2145931264, objects: max 8192 current 8192 > 2018-06-06 08:14:17.612410 7fffe7a35700 5 client.16794661 put_cap_ref > dropped last FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=4 > ll_ref=31 cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} > mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 > caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects > 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) > 2018-06-06 08:14:17.612438 7fffe7a35700 10 client.16794661 put_inode on > 0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31 > cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 > size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 > caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects > 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) > 2018-06-06 08:14:17.612446 7fffe7a35700 15 inode.put on 0x55555f138680 > 0x1000d092e5f.head now 3 > > > The 2-3ms trip to the OSD seems clear between 17.609s to 17.612s. This is > also confirmed by running ' objecter_requests' through the admin socket of > the client. Then immediately the next ll_read starts: > > 2018-06-06 08:14:17.612503 7fffe6a33700 3 client.16794661 ll_read > 0x5556dadfc1a0 0x1000d092e5f 238173655040~4096 > 2018-06-06 08:14:17.612516 7fffe6a33700 10 client.16794661 get_caps > 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31 > cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0} mode=100664 > size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 > caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects > 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) have > pAsLsXsFsxcrwb need Fr want Fc revoking - > 2018-06-06 08:14:17.612526 7fffe6a33700 10 client.16794661 _read_async > 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31 > cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 > size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 > caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects > 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) 238173655040~4096 > 2018-06-06 08:14:17.612533 7fffe6a33700 10 client.16794661 > min_bytes=4194304 max_bytes=268435456 max_periods=64 > 2018-06-06 08:14:17.612539 7fffe6a33700 10 objectcacher readx > extent(1000d092e5f.0000ddd1 (56785) in @6 102400~4096 -> [0,4096]) > 2018-06-06 08:14:17.612542 7fffe6a33700 10 > objectcacher.object(1000d092e5f.0000ddd1/head) map_read 1000d092e5f.0000ddd1 > 102400~4096 > 2018-06-06 08:14:17.612545 7fffe6a33700 20 > objectcacher.object(1000d092e5f.0000ddd1/head) map_read miss 4096 left, bh[ > 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 missing] waiters = {} > 2018-06-06 08:14:17.612548 7fffe6a33700 7 objectcacher bh_read on bh[ > 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 missing] waiters = {} > outstanding reads 0 > 2018-06-06 08:14:17.612587 7fffe6a33700 10 objectcacher readx missed, > waiting on bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 rx] waiters > = {} off 102400 > 2018-06-06 08:14:17.612591 7fffe6a33700 20 objectcacher readx defer > 0x55568850eee0 > 2018-06-06 08:14:17.612591 7fffe6a33700 5 client.16794661 get_cap_ref got > first FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31 > cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 > size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 > caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects > 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) > 2018-06-06 08:14:17.612598 7fffe6a33700 15 inode.get on 0x55555f138680 > 0x1000d092e5f.head now 4 > 2018-06-06 08:14:17.615300 7fffefa45700 7 objectcacher bh_read_finish > 1000d092e5f.0000ddd1/head tid 29067612 102400~4096 (bl is 4096) returned 0 > outstanding reads 1 > 2018-06-06 08:14:17.615321 7fffefa45700 20 objectcacher checking bh bh[ > 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 rx] waiters = { > 102400->[0x555724927d10, ]} > 2018-06-06 08:14:17.615324 7fffefa45700 10 objectcacher bh_read_finish read > bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean firstbyte=58] > waiters = {} > 2018-06-06 08:14:17.615327 7fffefa45700 10 > objectcacher.object(1000d092e5f.0000ddd1/head) try_merge_bh bh[ > 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean firstbyte=58] > waiters = {} > 2018-06-06 08:14:17.615329 7fffefa45700 20 objectcacher finishing waiters > 0x555724927d10 > 2018-06-06 08:14:17.615330 7fffefa45700 10 objectcacher readx > extent(1000d092e5f.0000ddd1 (56785) in @6 102400~4096 -> [0,4096]) > 2018-06-06 08:14:17.615332 7fffefa45700 10 > objectcacher.object(1000d092e5f.0000ddd1/head) map_read 1000d092e5f.0000ddd1 > 102400~4096 > 2018-06-06 08:14:17.615349 7fffefa45700 20 > objectcacher.object(1000d092e5f.0000ddd1/head) map_read hit bh[ > 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean firstbyte=58] > waiters = {} > 2018-06-06 08:14:17.615351 7fffefa45700 10 objectcacher readx hit bh bh[ > 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean firstbyte=58] > waiters = {} > 2018-06-06 08:14:17.615353 7fffefa45700 10 objectcacher readx rmap opos > 102400: bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean > firstbyte=58] waiters = {} +0 frag 0~4096 +0~4096 > 2018-06-06 08:14:17.615356 7fffefa45700 10 objectcacher readx has all > buffers > 2018-06-06 08:14:17.615357 7fffefa45700 10 objectcacher readx adding buffer > len 4096 at 0 > 2018-06-06 08:14:17.615358 7fffefa45700 10 objectcacher readx result is > 4096 > 2018-06-06 08:14:17.615358 7fffefa45700 20 objectcacher readx done > 0x55568850eee0 4096 > 2018-06-06 08:14:17.615359 7fffefa45700 10 objectcacher trim start: bytes: > max 2147483640 clean 2145935360, objects: max 8192 current 8192 > 2018-06-06 08:14:17.615361 7fffefa45700 10 objectcacher trim finish: max > 2147483640 clean 2145935360, objects: max 8192 current 8192 > 2018-06-06 08:14:17.615432 7fffe6a33700 5 client.16794661 put_cap_ref > dropped last FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=4 > ll_ref=31 cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} > mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 > caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects > 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) > 2018-06-06 08:14:17.615462 7fffe6a33700 10 client.16794661 put_inode on > 0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31 > cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 > size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 > caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects > 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) > 2018-06-06 08:14:17.615470 7fffe6a33700 15 inode.put on 0x55555f138680 > 0x1000d092e5f.head now 3 > > > We're running with increase client limits in ceph.conf: > > client_cache_size = 262144 > client_oc_max_dirty = 536870912 > client_oc_max_objects = 8192 > client_oc_size = 2147483640 > client_readahead_max_periods = 64 > client_readahead_min = 4194304 > > > Some things I'm confused about: > > This node has exclusive access to the file, no other node is > reading/writing/accessing the file. This seems to be confirmed by the caps, > it has Fr/Fc and almost everything else. I also don't see any trips to the > MDS for caps, so no problems there. Yet, for some reason, ceph-fuse goes to > the OSD to read 4k chunks without any read-ahead or larger reads of some > kind. So it suffers the 2-3ms latency for every 4k read. Why would this > be? I.e. why is read ahead not working here? > Why does ceph-fuse get such tiny read requests from the kernel when the > client application is reading in 64k chunks? I see on various other clients > (on other nodes) the kernel issuing 128k chunk read requests generally to > fuse. > Something else that is strange is that successive ll_read offsets are 8k > apart, even though the reads are 4k in size (238173646848~4096 followed by > 238173655040~4096 for example). I don't understand how the 4k read becomes > somehow 8k along the way. > > > Perhaps some settings I have aren't quite right about caching here? Any > hints/suggestions would be welcome. I'd also be interested in any > hints/pointers on the internals to help me better understand how things are > supposed to work. Is your cache size really set to 256k? I'm not sure how it leads directly to the IO patterns you're seeing, but that seems like it might be a problem. :) -Greg > > Thanks, > > Andras > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com