We're using CephFS with Luminous 12.2.5 and the fuse client (on
CentOS 7.4, kernel 3.10.0-693.5.2.el7.x86_64). Performance has been
very good generally, but we're currently running into some strange
performance issues with one of our applications. The client in this
case is on a higher latency link - it is about 2.5ms away from all
the ceph server nodes (all ceph server nodes are near each other on
10/40Ggbps local ethernet, only the client is "away").
The application is reading contiguous data at 64k chunks, the strace
(-tt -T flags) looks something like:
06:37:04.152667 read(3,
".:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./"..., 65536) = 65536
<0.024052>
06:37:04.178432 read(3,
",1523\t./.:.:.:.:.\t0/0:34,0:34:99"..., 65536) = 65536
<0.023990>
06:37:04.204087 read(3,
":20:21:0,21,738\t0/0:8,0:8:0:0,0,"..., 65536) = 65536
<0.024053>
06:37:04.229919 read(3,
"665\t0/0:35,0:35:99:0,102,1530\t./"..., 65536) = 65536
<0.024574>
06:37:04.255623 read(3,
":37:99:0,99,1485\t0/0:34,0:34:99:"..., 65536) = 65536
<0.023795>
06:37:04.280914 read(3,
":.\t./.:.:.:.:.:.:.\t./.:.:.:.:.:."..., 65536) = 65536
<0.023614>
06:37:04.306022 read(3,
"0,0,0\t./.:0,0:0:.:0,0,0\t./.:0,0:"..., 65536) = 65536
<0.024037>
so each 64k read takes about 23-24ms. The client has the file open
for read, the machine is not busy (load of 0.2), neither are the
ceph nodes. The fuse client seems pretty idle also.
Increasing the log level to 20 for 'client' and 'objectcacher' on
ceph-fuse, it looks like ceph-fuse gets ll_read requests of 4k in
size, and it looks like it does an async read from the OSDs in 4k
chunks (if I'm interpreting the logs right). Here is a trace of one
ll_read:
2018-06-06 08:14:17.609495 7fffe7a35700 3
client.16794661 ll_read 0x5556dadfc1a0 0x1000d092e5f
238173646848~4096
2018-06-06 08:14:17.609506 7fffe7a35700 10 client.16794661
get_caps 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0}
mode=100664 size=244712765330/249011634176 mtime=2018-06-05
00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
parents=0x55555f187680 0x55555f138680) have pAsLsXsFsxcrwb need
Fr want Fc revoking -
2018-06-06 08:14:17.609517 7fffe7a35700 10 client.16794661
_read_async 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
mode=100664 size=244712765330/249011634176 mtime=2018-06-05
00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
parents=0x55555f187680 0x55555f138680) 238173646848~4096
2018-06-06 08:14:17.609523 7fffe7a35700 10 client.16794661
min_bytes=4194304 max_bytes=268435456 max_periods=64
2018-06-06 08:14:17.609528 7fffe7a35700 10 objectcacher readx
extent(1000d092e5f.0000ddd1 (56785) in @6 94208~4096 ->
[0,4096])
2018-06-06 08:14:17.609532 7fffe7a35700 10
objectcacher.object(1000d092e5f.0000ddd1/head) map_read
1000d092e5f.0000ddd1 94208~4096
2018-06-06 08:14:17.609535 7fffe7a35700 20
objectcacher.object(1000d092e5f.0000ddd1/head) map_read miss
4096 left, bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0
missing] waiters = {}
2018-06-06 08:14:17.609537 7fffe7a35700 7 objectcacher
bh_read on bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0
missing] waiters = {} outstanding reads 0
2018-06-06 08:14:17.609576 7fffe7a35700 10 objectcacher readx
missed, waiting on bh[ 0x55555fecdd40 94208~4096 0x5556226235c0
(0) v 0 rx] waiters = {} off 94208
2018-06-06 08:14:17.609579 7fffe7a35700 20 objectcacher readx
defer 0x55570211ec00
2018-06-06 08:14:17.609580 7fffe7a35700 5 client.16794661
get_cap_ref got first FILE_CACHE ref on
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
mode=100664 size=244712765330/249011634176 mtime=2018-06-05
00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.609587 7fffe7a35700 15 inode.get on
0x55555f138680 0x1000d092e5f.head now 4
2018-06-06 08:14:17.612318 7fffefa45700 7 objectcacher
bh_read_finish 1000d092e5f.0000ddd1/head tid 29067611 94208~4096
(bl is 4096) returned 0 outstanding reads 1
2018-06-06 08:14:17.612338 7fffefa45700 20 objectcacher
checking bh bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0
rx] waiters = { 94208->[0x5557007383a0, ]}
2018-06-06 08:14:17.612341 7fffefa45700 10 objectcacher
bh_read_finish read bh[ 0x55555fecdd40 94208~4096 0x5556226235c0
(4096) v 0 clean firstbyte=46] waiters = {}
2018-06-06 08:14:17.612344 7fffefa45700 10
objectcacher.object(1000d092e5f.0000ddd1/head) try_merge_bh bh[
0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean
firstbyte=46] waiters = {}
2018-06-06 08:14:17.612346 7fffefa45700 20 objectcacher
finishing waiters 0x5557007383a0
2018-06-06 08:14:17.612347 7fffefa45700 10 objectcacher readx
extent(1000d092e5f.0000ddd1 (56785) in @6 94208~4096 ->
[0,4096])
2018-06-06 08:14:17.612349 7fffefa45700 10
objectcacher.object(1000d092e5f.0000ddd1/head) map_read
1000d092e5f.0000ddd1 94208~4096
2018-06-06 08:14:17.612350 7fffefa45700 20
objectcacher.object(1000d092e5f.0000ddd1/head) map_read hit bh[
0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean
firstbyte=46] waiters = {}
2018-06-06 08:14:17.612352 7fffefa45700 10 objectcacher readx
hit bh bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0
clean firstbyte=46] waiters = {}
2018-06-06 08:14:17.612354 7fffefa45700 10 objectcacher readx
rmap opos 94208: bh[ 0x55555fecdd40 94208~4096 0x5556226235c0
(4096) v 0 clean firstbyte=46] waiters = {} +0 frag 0~4096
+0~4096
2018-06-06 08:14:17.612357 7fffefa45700 10 objectcacher readx
has all buffers
2018-06-06 08:14:17.612357 7fffefa45700 10 objectcacher readx
adding buffer len 4096 at 0
2018-06-06 08:14:17.612358 7fffefa45700 10 objectcacher readx
result is 4096
2018-06-06 08:14:17.612359 7fffefa45700 20 objectcacher readx
done 0x55570211ec00 4096
2018-06-06 08:14:17.612360 7fffefa45700 10 objectcacher trim
start: bytes: max 2147483640 clean 2145931264, objects: max
8192 current 8192
2018-06-06 08:14:17.612361 7fffefa45700 10 objectcacher trim
finish: max 2147483640 clean 2145931264, objects: max 8192
current 8192
2018-06-06 08:14:17.612410 7fffe7a35700 5 client.16794661
put_cap_ref dropped last FILE_CACHE ref on
0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
mode=100664 size=244712765330/249011634176 mtime=2018-06-05
00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.612438 7fffe7a35700 10 client.16794661
put_inode on 0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
mode=100664 size=244712765330/249011634176 mtime=2018-06-05
00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.612446 7fffe7a35700 15 inode.put on
0x55555f138680 0x1000d092e5f.head now 3
The 2-3ms trip to the OSD seems clear between 17.609s to 17.612s.
This is also confirmed by running ' objecter_requests' through the
admin socket of the client. Then immediately the next ll_read
starts:
2018-06-06 08:14:17.612503 7fffe6a33700 3
client.16794661 ll_read 0x5556dadfc1a0 0x1000d092e5f
238173655040~4096
2018-06-06 08:14:17.612516 7fffe6a33700 10 client.16794661
get_caps 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0}
mode=100664 size=244712765330/249011634176 mtime=2018-06-05
00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
parents=0x55555f187680 0x55555f138680) have pAsLsXsFsxcrwb need
Fr want Fc revoking -
2018-06-06 08:14:17.612526 7fffe6a33700 10 client.16794661
_read_async 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
mode=100664 size=244712765330/249011634176 mtime=2018-06-05
00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
parents=0x55555f187680 0x55555f138680) 238173655040~4096
2018-06-06 08:14:17.612533 7fffe6a33700 10 client.16794661
min_bytes=4194304 max_bytes=268435456 max_periods=64
2018-06-06 08:14:17.612539 7fffe6a33700 10 objectcacher readx
extent(1000d092e5f.0000ddd1 (56785) in @6 102400~4096 ->
[0,4096])
2018-06-06 08:14:17.612542 7fffe6a33700 10
objectcacher.object(1000d092e5f.0000ddd1/head) map_read
1000d092e5f.0000ddd1 102400~4096
2018-06-06 08:14:17.612545 7fffe6a33700 20
objectcacher.object(1000d092e5f.0000ddd1/head) map_read miss
4096 left, bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0
missing] waiters = {}
2018-06-06 08:14:17.612548 7fffe6a33700 7 objectcacher
bh_read on bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0
missing] waiters = {} outstanding reads 0
2018-06-06 08:14:17.612587 7fffe6a33700 10 objectcacher readx
missed, waiting on bh[ 0x555653fd3b00 102400~4096 0x5556226235c0
(0) v 0 rx] waiters = {} off 102400
2018-06-06 08:14:17.612591 7fffe6a33700 20 objectcacher readx
defer 0x55568850eee0
2018-06-06 08:14:17.612591 7fffe6a33700 5 client.16794661
get_cap_ref got first FILE_CACHE ref on
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
mode=100664 size=244712765330/249011634176 mtime=2018-06-05
00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.612598 7fffe6a33700 15 inode.get on
0x55555f138680 0x1000d092e5f.head now 4
2018-06-06 08:14:17.615300 7fffefa45700 7 objectcacher
bh_read_finish 1000d092e5f.0000ddd1/head tid 29067612
102400~4096 (bl is 4096) returned 0 outstanding reads 1
2018-06-06 08:14:17.615321 7fffefa45700 20 objectcacher
checking bh bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v
0 rx] waiters = { 102400->[0x555724927d10, ]}
2018-06-06 08:14:17.615324 7fffefa45700 10 objectcacher
bh_read_finish read bh[ 0x555653fd3b00 102400~4096
0x5556226235c0 (4096) v 0 clean firstbyte=58] waiters = {}
2018-06-06 08:14:17.615327 7fffefa45700 10
objectcacher.object(1000d092e5f.0000ddd1/head) try_merge_bh bh[
0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean
firstbyte=58] waiters = {}
2018-06-06 08:14:17.615329 7fffefa45700 20 objectcacher
finishing waiters 0x555724927d10
2018-06-06 08:14:17.615330 7fffefa45700 10 objectcacher readx
extent(1000d092e5f.0000ddd1 (56785) in @6 102400~4096 ->
[0,4096])
2018-06-06 08:14:17.615332 7fffefa45700 10
objectcacher.object(1000d092e5f.0000ddd1/head) map_read
1000d092e5f.0000ddd1 102400~4096
2018-06-06 08:14:17.615349 7fffefa45700 20
objectcacher.object(1000d092e5f.0000ddd1/head) map_read hit bh[
0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean
firstbyte=58] waiters = {}
2018-06-06 08:14:17.615351 7fffefa45700 10 objectcacher readx
hit bh bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0
clean firstbyte=58] waiters = {}
2018-06-06 08:14:17.615353 7fffefa45700 10 objectcacher readx
rmap opos 102400: bh[ 0x555653fd3b00 102400~4096 0x5556226235c0
(4096) v 0 clean firstbyte=58] waiters = {} +0 frag 0~4096
+0~4096
2018-06-06 08:14:17.615356 7fffefa45700 10 objectcacher readx
has all buffers
2018-06-06 08:14:17.615357 7fffefa45700 10 objectcacher readx
adding buffer len 4096 at 0
2018-06-06 08:14:17.615358 7fffefa45700 10 objectcacher readx
result is 4096
2018-06-06 08:14:17.615358 7fffefa45700 20 objectcacher readx
done 0x55568850eee0 4096
2018-06-06 08:14:17.615359 7fffefa45700 10 objectcacher trim
start: bytes: max 2147483640 clean 2145935360, objects: max
8192 current 8192
2018-06-06 08:14:17.615361 7fffefa45700 10 objectcacher trim
finish: max 2147483640 clean 2145935360, objects: max 8192
current 8192
2018-06-06 08:14:17.615432 7fffe6a33700 5 client.16794661
put_cap_ref dropped last FILE_CACHE ref on
0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
mode=100664 size=244712765330/249011634176 mtime=2018-06-05
00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.615462 7fffe6a33700 10 client.16794661
put_inode on 0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
mode=100664 size=244712765330/249011634176 mtime=2018-06-05
00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.615470 7fffe6a33700 15 inode.put on
0x55555f138680 0x1000d092e5f.head now 3
We're running with increase client limits in ceph.conf:
client_cache_size = 262144
client_oc_max_dirty = 536870912
client_oc_max_objects = 8192
client_oc_size = 2147483640
client_readahead_max_periods = 64
client_readahead_min = 4194304
Some things I'm confused about:
- This node has exclusive access to the file, no other node is
reading/writing/accessing the file. This seems to be confirmed
by the caps, it has Fr/Fc and almost everything else. I also
don't see any trips to the MDS for caps, so no problems there.
Yet, for some reason, ceph-fuse goes to the OSD to read 4k
chunks without any read-ahead or larger reads of some kind. So
it suffers the 2-3ms latency for every 4k read. Why would this
be? I.e. why is read ahead not working here?
- Why does ceph-fuse get such tiny read requests from the kernel
when the client application is reading in 64k chunks? I see on
various other clients (on other nodes) the kernel issuing 128k
chunk read requests generally to fuse.
- Something else that is strange is that successive ll_read
offsets are 8k apart, even though the reads are 4k in size
(238173646848~4096 followed by 238173655040~4096 for example).
I don't understand how the 4k read becomes somehow 8k along the
way.
Perhaps some settings I have aren't quite right about caching here?
Any hints/suggestions would be welcome. I'd also be interested in
any hints/pointers on the internals to help me better understand how
things are supposed to work.
Thanks,
Andras
|