We're using CephFS with Luminous 12.2.5 and the fuse client (on CentOS 7.4,
kernel 3.10.0-693.5.2.el7.x86_64). Performance has been very good
generally, but we're currently running into some strange performance issues
with one of our applications. The client in this case is on a higher
latency link - it is about 2.5ms away from all the ceph server nodes (all
ceph server nodes are near each other on 10/40Ggbps local ethernet, only the
client is "away").
The application is reading contiguous data at 64k chunks, the strace (-tt -T
flags) looks something like:
06:37:04.152667 read(3, ".:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./"..., 65536) =
65536 <0.024052>
06:37:04.178432 read(3, ",1523\t./.:.:.:.:.\t0/0:34,0:34:99"..., 65536) =
65536 <0.023990>
06:37:04.204087 read(3, ":20:21:0,21,738\t0/0:8,0:8:0:0,0,"..., 65536) =
65536 <0.024053>
06:37:04.229919 read(3, "665\t0/0:35,0:35:99:0,102,1530\t./"..., 65536) =
65536 <0.024574>
06:37:04.255623 read(3, ":37:99:0,99,1485\t0/0:34,0:34:99:"..., 65536) =
65536 <0.023795>
06:37:04.280914 read(3, ":.\t./.:.:.:.:.:.:.\t./.:.:.:.:.:."..., 65536) =
65536 <0.023614>
06:37:04.306022 read(3, "0,0,0\t./.:0,0:0:.:0,0,0\t./.:0,0:"..., 65536) =
65536 <0.024037>
so each 64k read takes about 23-24ms. The client has the file open for
read, the machine is not busy (load of 0.2), neither are the ceph nodes.
The fuse client seems pretty idle also.
Increasing the log level to 20 for 'client' and 'objectcacher' on ceph-fuse,
it looks like ceph-fuse gets ll_read requests of 4k in size, and it looks
like it does an async read from the OSDs in 4k chunks (if I'm interpreting
the logs right). Here is a trace of one ll_read:
2018-06-06 08:14:17.609495 7fffe7a35700 3 client.16794661 ll_read
0x5556dadfc1a0 0x1000d092e5f 238173646848~4096
2018-06-06 08:14:17.609506 7fffe7a35700 10 client.16794661 get_caps
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) have
pAsLsXsFsxcrwb need Fr want Fc revoking -
2018-06-06 08:14:17.609517 7fffe7a35700 10 client.16794661 _read_async
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) 238173646848~4096
2018-06-06 08:14:17.609523 7fffe7a35700 10 client.16794661
min_bytes=4194304 max_bytes=268435456 max_periods=64
2018-06-06 08:14:17.609528 7fffe7a35700 10 objectcacher readx
extent(1000d092e5f.0000ddd1 (56785) in @6 94208~4096 -> [0,4096])
2018-06-06 08:14:17.609532 7fffe7a35700 10
objectcacher.object(1000d092e5f.0000ddd1/head) map_read 1000d092e5f.0000ddd1
94208~4096
2018-06-06 08:14:17.609535 7fffe7a35700 20
objectcacher.object(1000d092e5f.0000ddd1/head) map_read miss 4096 left, bh[
0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
2018-06-06 08:14:17.609537 7fffe7a35700 7 objectcacher bh_read on bh[
0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
outstanding reads 0
2018-06-06 08:14:17.609576 7fffe7a35700 10 objectcacher readx missed,
waiting on bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters
= {} off 94208
2018-06-06 08:14:17.609579 7fffe7a35700 20 objectcacher readx defer
0x55570211ec00
2018-06-06 08:14:17.609580 7fffe7a35700 5 client.16794661 get_cap_ref got
first FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.609587 7fffe7a35700 15 inode.get on 0x55555f138680
0x1000d092e5f.head now 4
2018-06-06 08:14:17.612318 7fffefa45700 7 objectcacher bh_read_finish
1000d092e5f.0000ddd1/head tid 29067611 94208~4096 (bl is 4096) returned 0
outstanding reads 1
2018-06-06 08:14:17.612338 7fffefa45700 20 objectcacher checking bh bh[
0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters = {
94208->[0x5557007383a0, ]}
2018-06-06 08:14:17.612341 7fffefa45700 10 objectcacher bh_read_finish read
bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46]
waiters = {}
2018-06-06 08:14:17.612344 7fffefa45700 10
objectcacher.object(1000d092e5f.0000ddd1/head) try_merge_bh bh[
0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46]
waiters = {}
2018-06-06 08:14:17.612346 7fffefa45700 20 objectcacher finishing waiters
0x5557007383a0
2018-06-06 08:14:17.612347 7fffefa45700 10 objectcacher readx
extent(1000d092e5f.0000ddd1 (56785) in @6 94208~4096 -> [0,4096])
2018-06-06 08:14:17.612349 7fffefa45700 10
objectcacher.object(1000d092e5f.0000ddd1/head) map_read 1000d092e5f.0000ddd1
94208~4096
2018-06-06 08:14:17.612350 7fffefa45700 20
objectcacher.object(1000d092e5f.0000ddd1/head) map_read hit bh[
0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46]
waiters = {}
2018-06-06 08:14:17.612352 7fffefa45700 10 objectcacher readx hit bh bh[
0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46]
waiters = {}
2018-06-06 08:14:17.612354 7fffefa45700 10 objectcacher readx rmap opos
94208: bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean
firstbyte=46] waiters = {} +0 frag 0~4096 +0~4096
2018-06-06 08:14:17.612357 7fffefa45700 10 objectcacher readx has all
buffers
2018-06-06 08:14:17.612357 7fffefa45700 10 objectcacher readx adding buffer
len 4096 at 0
2018-06-06 08:14:17.612358 7fffefa45700 10 objectcacher readx result is
4096
2018-06-06 08:14:17.612359 7fffefa45700 20 objectcacher readx done
0x55570211ec00 4096
2018-06-06 08:14:17.612360 7fffefa45700 10 objectcacher trim start: bytes:
max 2147483640 clean 2145931264, objects: max 8192 current 8192
2018-06-06 08:14:17.612361 7fffefa45700 10 objectcacher trim finish: max
2147483640 clean 2145931264, objects: max 8192 current 8192
2018-06-06 08:14:17.612410 7fffe7a35700 5 client.16794661 put_cap_ref
dropped last FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=4
ll_ref=31 cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.612438 7fffe7a35700 10 client.16794661 put_inode on
0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.612446 7fffe7a35700 15 inode.put on 0x55555f138680
0x1000d092e5f.head now 3
The 2-3ms trip to the OSD seems clear between 17.609s to 17.612s. This is
also confirmed by running ' objecter_requests' through the admin socket of
the client. Then immediately the next ll_read starts:
2018-06-06 08:14:17.612503 7fffe6a33700 3 client.16794661 ll_read
0x5556dadfc1a0 0x1000d092e5f 238173655040~4096
2018-06-06 08:14:17.612516 7fffe6a33700 10 client.16794661 get_caps
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) have
pAsLsXsFsxcrwb need Fr want Fc revoking -
2018-06-06 08:14:17.612526 7fffe6a33700 10 client.16794661 _read_async
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) 238173655040~4096
2018-06-06 08:14:17.612533 7fffe6a33700 10 client.16794661
min_bytes=4194304 max_bytes=268435456 max_periods=64
2018-06-06 08:14:17.612539 7fffe6a33700 10 objectcacher readx
extent(1000d092e5f.0000ddd1 (56785) in @6 102400~4096 -> [0,4096])
2018-06-06 08:14:17.612542 7fffe6a33700 10
objectcacher.object(1000d092e5f.0000ddd1/head) map_read 1000d092e5f.0000ddd1
102400~4096
2018-06-06 08:14:17.612545 7fffe6a33700 20
objectcacher.object(1000d092e5f.0000ddd1/head) map_read miss 4096 left, bh[
0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
2018-06-06 08:14:17.612548 7fffe6a33700 7 objectcacher bh_read on bh[
0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
outstanding reads 0
2018-06-06 08:14:17.612587 7fffe6a33700 10 objectcacher readx missed,
waiting on bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 rx] waiters
= {} off 102400
2018-06-06 08:14:17.612591 7fffe6a33700 20 objectcacher readx defer
0x55568850eee0
2018-06-06 08:14:17.612591 7fffe6a33700 5 client.16794661 get_cap_ref got
first FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.612598 7fffe6a33700 15 inode.get on 0x55555f138680
0x1000d092e5f.head now 4
2018-06-06 08:14:17.615300 7fffefa45700 7 objectcacher bh_read_finish
1000d092e5f.0000ddd1/head tid 29067612 102400~4096 (bl is 4096) returned 0
outstanding reads 1
2018-06-06 08:14:17.615321 7fffefa45700 20 objectcacher checking bh bh[
0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 rx] waiters = {
102400->[0x555724927d10, ]}
2018-06-06 08:14:17.615324 7fffefa45700 10 objectcacher bh_read_finish read
bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean firstbyte=58]
waiters = {}
2018-06-06 08:14:17.615327 7fffefa45700 10
objectcacher.object(1000d092e5f.0000ddd1/head) try_merge_bh bh[
0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean firstbyte=58]
waiters = {}
2018-06-06 08:14:17.615329 7fffefa45700 20 objectcacher finishing waiters
0x555724927d10
2018-06-06 08:14:17.615330 7fffefa45700 10 objectcacher readx
extent(1000d092e5f.0000ddd1 (56785) in @6 102400~4096 -> [0,4096])
2018-06-06 08:14:17.615332 7fffefa45700 10
objectcacher.object(1000d092e5f.0000ddd1/head) map_read 1000d092e5f.0000ddd1
102400~4096
2018-06-06 08:14:17.615349 7fffefa45700 20
objectcacher.object(1000d092e5f.0000ddd1/head) map_read hit bh[
0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean firstbyte=58]
waiters = {}
2018-06-06 08:14:17.615351 7fffefa45700 10 objectcacher readx hit bh bh[
0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean firstbyte=58]
waiters = {}
2018-06-06 08:14:17.615353 7fffefa45700 10 objectcacher readx rmap opos
102400: bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean
firstbyte=58] waiters = {} +0 frag 0~4096 +0~4096
2018-06-06 08:14:17.615356 7fffefa45700 10 objectcacher readx has all
buffers
2018-06-06 08:14:17.615357 7fffefa45700 10 objectcacher readx adding buffer
len 4096 at 0
2018-06-06 08:14:17.615358 7fffefa45700 10 objectcacher readx result is
4096
2018-06-06 08:14:17.615358 7fffefa45700 20 objectcacher readx done
0x55568850eee0 4096
2018-06-06 08:14:17.615359 7fffefa45700 10 objectcacher trim start: bytes:
max 2147483640 clean 2145935360, objects: max 8192 current 8192
2018-06-06 08:14:17.615361 7fffefa45700 10 objectcacher trim finish: max
2147483640 clean 2145935360, objects: max 8192 current 8192
2018-06-06 08:14:17.615432 7fffe6a33700 5 client.16794661 put_cap_ref
dropped last FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=4
ll_ref=31 cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.615462 7fffe6a33700 10 client.16794661 put_inode on
0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.615470 7fffe6a33700 15 inode.put on 0x55555f138680
0x1000d092e5f.head now 3
We're running with increase client limits in ceph.conf:
client_cache_size = 262144
client_oc_max_dirty = 536870912
client_oc_max_objects = 8192
client_oc_size = 2147483640
client_readahead_max_periods = 64
client_readahead_min = 4194304
Some things I'm confused about:
This node has exclusive access to the file, no other node is
reading/writing/accessing the file. This seems to be confirmed by the caps,
it has Fr/Fc and almost everything else. I also don't see any trips to the
MDS for caps, so no problems there. Yet, for some reason, ceph-fuse goes to
the OSD to read 4k chunks without any read-ahead or larger reads of some
kind. So it suffers the 2-3ms latency for every 4k read. Why would this
be? I.e. why is read ahead not working here?
Why does ceph-fuse get such tiny read requests from the kernel when the
client application is reading in 64k chunks? I see on various other clients
(on other nodes) the kernel issuing 128k chunk read requests generally to
fuse.
Something else that is strange is that successive ll_read offsets are 8k
apart, even though the reads are 4k in size (238173646848~4096 followed by
238173655040~4096 for example). I don't understand how the 4k read becomes
somehow 8k along the way.
Perhaps some settings I have aren't quite right about caching here? Any
hints/suggestions would be welcome. I'd also be interested in any
hints/pointers on the internals to help me better understand how things are
supposed to work.