CephFS/ceph-fuse performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We're using CephFS with Luminous 12.2.5 and the fuse client (on CentOS 7.4, kernel 3.10.0-693.5.2.el7.x86_64).  Performance has been very good generally, but we're currently running into some strange performance issues with one of our applications.  The client in this case is on a higher latency link - it is about 2.5ms away from all the ceph server nodes (all ceph server nodes are near each other on 10/40Ggbps local ethernet, only the client is "away").

The application is reading contiguous data at 64k chunks, the strace (-tt -T flags) looks something like:

06:37:04.152667 read(3, ".:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./"..., 65536) = 65536 <0.024052>
06:37:04.178432 read(3, ",1523\t./.:.:.:.:.\t0/0:34,0:34:99"..., 65536) = 65536 <0.023990>
06:37:04.204087 read(3, ":20:21:0,21,738\t0/0:8,0:8:0:0,0,"..., 65536) = 65536 <0.024053>
06:37:04.229919 read(3, "665\t0/0:35,0:35:99:0,102,1530\t./"..., 65536) = 65536 <0.024574>
06:37:04.255623 read(3, ":37:99:0,99,1485\t0/0:34,0:34:99:"..., 65536) = 65536 <0.023795>
06:37:04.280914 read(3, ":.\t./.:.:.:.:.:.:.\t./.:.:.:.:.:."..., 65536) = 65536 <0.023614>
06:37:04.306022 read(3, "0,0,0\t./.:0,0:0:.:0,0,0\t./.:0,0:"..., 65536) = 65536 <0.024037>

so each 64k read takes about 23-24ms.  The client has the file open for read, the machine is not busy (load of 0.2), neither are the ceph nodes.  The fuse client seems pretty idle also.

Increasing the log level to 20 for 'client' and 'objectcacher' on ceph-fuse, it looks like ceph-fuse gets ll_read requests of 4k in size, and it looks like it does an async read from the OSDs in 4k chunks (if I'm interpreting the logs right).  Here is a trace of one ll_read:

2018-06-06 08:14:17.609495 7fffe7a35700  3 client.16794661 ll_read 0x5556dadfc1a0 0x1000d092e5f  238173646848~4096
2018-06-06 08:14:17.609506 7fffe7a35700 10 client.16794661 get_caps 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31 cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0} mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) have pAsLsXsFsxcrwb need Fr want Fc revoking -
2018-06-06 08:14:17.609517 7fffe7a35700 10 client.16794661 _read_async 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31 cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) 238173646848~4096
2018-06-06 08:14:17.609523 7fffe7a35700 10 client.16794661  min_bytes=4194304 max_bytes=268435456 max_periods=64
2018-06-06 08:14:17.609528 7fffe7a35700 10 objectcacher readx extent(1000d092e5f.0000ddd1 (56785) in @6 94208~4096 -> [0,4096])
2018-06-06 08:14:17.609532 7fffe7a35700 10 objectcacher.object(1000d092e5f.0000ddd1/head) map_read 1000d092e5f.0000ddd1 94208~4096
2018-06-06 08:14:17.609535 7fffe7a35700 20 objectcacher.object(1000d092e5f.0000ddd1/head) map_read miss 4096 left, bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
2018-06-06 08:14:17.609537 7fffe7a35700  7 objectcacher bh_read on bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {} outstanding reads 0
2018-06-06 08:14:17.609576 7fffe7a35700 10 objectcacher readx missed, waiting on bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters = {} off 94208
2018-06-06 08:14:17.609579 7fffe7a35700 20 objectcacher readx defer 0x55570211ec00
2018-06-06 08:14:17.609580 7fffe7a35700  5 client.16794661 get_cap_ref got first FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31 cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.609587 7fffe7a35700 15 inode.get on 0x55555f138680 0x1000d092e5f.head now 4
2018-06-06 08:14:17.612318 7fffefa45700  7 objectcacher bh_read_finish 1000d092e5f.0000ddd1/head tid 29067611 94208~4096 (bl is 4096) returned 0 outstanding reads 1
2018-06-06 08:14:17.612338 7fffefa45700 20 objectcacher checking bh bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters = { 94208->[0x5557007383a0, ]}
2018-06-06 08:14:17.612341 7fffefa45700 10 objectcacher bh_read_finish read bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46] waiters = {}
2018-06-06 08:14:17.612344 7fffefa45700 10 objectcacher.object(1000d092e5f.0000ddd1/head) try_merge_bh bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46] waiters = {}
2018-06-06 08:14:17.612346 7fffefa45700 20 objectcacher finishing waiters 0x5557007383a0
2018-06-06 08:14:17.612347 7fffefa45700 10 objectcacher readx extent(1000d092e5f.0000ddd1 (56785) in @6 94208~4096 -> [0,4096])
2018-06-06 08:14:17.612349 7fffefa45700 10 objectcacher.object(1000d092e5f.0000ddd1/head) map_read 1000d092e5f.0000ddd1 94208~4096
2018-06-06 08:14:17.612350 7fffefa45700 20 objectcacher.object(1000d092e5f.0000ddd1/head) map_read hit bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46] waiters = {}
2018-06-06 08:14:17.612352 7fffefa45700 10 objectcacher readx hit bh bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46] waiters = {}
2018-06-06 08:14:17.612354 7fffefa45700 10 objectcacher readx rmap opos 94208: bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46] waiters = {} +0 frag 0~4096 +0~4096
2018-06-06 08:14:17.612357 7fffefa45700 10 objectcacher readx has all buffers
2018-06-06 08:14:17.612357 7fffefa45700 10 objectcacher readx  adding buffer len 4096 at 0
2018-06-06 08:14:17.612358 7fffefa45700 10 objectcacher readx  result is 4096
2018-06-06 08:14:17.612359 7fffefa45700 20 objectcacher readx done 0x55570211ec00 4096
2018-06-06 08:14:17.612360 7fffefa45700 10 objectcacher trim  start: bytes: max 2147483640  clean 2145931264, objects: max 8192 current 8192
2018-06-06 08:14:17.612361 7fffefa45700 10 objectcacher trim finish:  max 2147483640  clean 2145931264, objects: max 8192 current 8192
2018-06-06 08:14:17.612410 7fffe7a35700  5 client.16794661 put_cap_ref dropped last FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31 cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.612438 7fffe7a35700 10 client.16794661 put_inode on 0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31 cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.612446 7fffe7a35700 15 inode.put on 0x55555f138680 0x1000d092e5f.head now 3

The 2-3ms trip to the OSD seems clear  between 17.609s to 17.612s.  This is also confirmed by running ' objecter_requests' through the admin socket of the client. Then immediately the next ll_read starts:

2018-06-06 08:14:17.612503 7fffe6a33700  3 client.16794661 ll_read 0x5556dadfc1a0 0x1000d092e5f  238173655040~4096
2018-06-06 08:14:17.612516 7fffe6a33700 10 client.16794661 get_caps 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31 cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0} mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) have pAsLsXsFsxcrwb need Fr want Fc revoking -
2018-06-06 08:14:17.612526 7fffe6a33700 10 client.16794661 _read_async 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31 cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680) 238173655040~4096
2018-06-06 08:14:17.612533 7fffe6a33700 10 client.16794661  min_bytes=4194304 max_bytes=268435456 max_periods=64
2018-06-06 08:14:17.612539 7fffe6a33700 10 objectcacher readx extent(1000d092e5f.0000ddd1 (56785) in @6 102400~4096 -> [0,4096])
2018-06-06 08:14:17.612542 7fffe6a33700 10 objectcacher.object(1000d092e5f.0000ddd1/head) map_read 1000d092e5f.0000ddd1 102400~4096
2018-06-06 08:14:17.612545 7fffe6a33700 20 objectcacher.object(1000d092e5f.0000ddd1/head) map_read miss 4096 left, bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
2018-06-06 08:14:17.612548 7fffe6a33700  7 objectcacher bh_read on bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 missing] waiters = {} outstanding reads 0
2018-06-06 08:14:17.612587 7fffe6a33700 10 objectcacher readx missed, waiting on bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 rx] waiters = {} off 102400
2018-06-06 08:14:17.612591 7fffe6a33700 20 objectcacher readx defer 0x55568850eee0
2018-06-06 08:14:17.612591 7fffe6a33700  5 client.16794661 get_cap_ref got first FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31 cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.612598 7fffe6a33700 15 inode.get on 0x55555f138680 0x1000d092e5f.head now 4
2018-06-06 08:14:17.615300 7fffefa45700  7 objectcacher bh_read_finish 1000d092e5f.0000ddd1/head tid 29067612 102400~4096 (bl is 4096) returned 0 outstanding reads 1
2018-06-06 08:14:17.615321 7fffefa45700 20 objectcacher checking bh bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 rx] waiters = { 102400->[0x555724927d10, ]}
2018-06-06 08:14:17.615324 7fffefa45700 10 objectcacher bh_read_finish read bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean firstbyte=58] waiters = {}
2018-06-06 08:14:17.615327 7fffefa45700 10 objectcacher.object(1000d092e5f.0000ddd1/head) try_merge_bh bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean firstbyte=58] waiters = {}
2018-06-06 08:14:17.615329 7fffefa45700 20 objectcacher finishing waiters 0x555724927d10
2018-06-06 08:14:17.615330 7fffefa45700 10 objectcacher readx extent(1000d092e5f.0000ddd1 (56785) in @6 102400~4096 -> [0,4096])
2018-06-06 08:14:17.615332 7fffefa45700 10 objectcacher.object(1000d092e5f.0000ddd1/head) map_read 1000d092e5f.0000ddd1 102400~4096
2018-06-06 08:14:17.615349 7fffefa45700 20 objectcacher.object(1000d092e5f.0000ddd1/head) map_read hit bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean firstbyte=58] waiters = {}
2018-06-06 08:14:17.615351 7fffefa45700 10 objectcacher readx hit bh bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean firstbyte=58] waiters = {}
2018-06-06 08:14:17.615353 7fffefa45700 10 objectcacher readx rmap opos 102400: bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean firstbyte=58] waiters = {} +0 frag 0~4096 +0~4096
2018-06-06 08:14:17.615356 7fffefa45700 10 objectcacher readx has all buffers
2018-06-06 08:14:17.615357 7fffefa45700 10 objectcacher readx  adding buffer len 4096 at 0
2018-06-06 08:14:17.615358 7fffefa45700 10 objectcacher readx  result is 4096
2018-06-06 08:14:17.615358 7fffefa45700 20 objectcacher readx done 0x55568850eee0 4096
2018-06-06 08:14:17.615359 7fffefa45700 10 objectcacher trim  start: bytes: max 2147483640  clean 2145935360, objects: max 8192 current 8192
2018-06-06 08:14:17.615361 7fffefa45700 10 objectcacher trim finish:  max 2147483640  clean 2145935360, objects: max 8192 current 8192
2018-06-06 08:14:17.615432 7fffe6a33700  5 client.16794661 put_cap_ref dropped last FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31 cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.615462 7fffe6a33700 10 client.16794661 put_inode on 0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31 cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664 size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0] parents=0x55555f187680 0x55555f138680)
2018-06-06 08:14:17.615470 7fffe6a33700 15 inode.put on 0x55555f138680 0x1000d092e5f.head now 3

We're running with increase client limits in ceph.conf:
client_cache_size = 262144
client_oc_max_dirty = 536870912
client_oc_max_objects = 8192
client_oc_size = 2147483640
client_readahead_max_periods = 64
client_readahead_min = 4194304

Some things I'm confused about:
  • This node has exclusive access to the file, no other node is reading/writing/accessing the file.  This seems to be confirmed by the caps, it has Fr/Fc and almost everything else.  I also don't see any trips to the MDS for caps, so no problems there.  Yet, for some reason, ceph-fuse goes to the OSD to read 4k chunks without any read-ahead or larger reads of some kind.  So it suffers the 2-3ms latency for every 4k read.  Why would this be?  I.e. why is read ahead not working here?
  • Why does ceph-fuse get such tiny read requests from the kernel when the client application is reading in 64k chunks?  I see on various other clients (on other nodes) the kernel issuing 128k chunk read requests generally to fuse.
  • Something else that is strange is that successive ll_read offsets are 8k apart, even though the reads are 4k in size (238173646848~4096 followed by 238173655040~4096 for example).  I don't understand how the 4k read becomes somehow 8k along the way.

Perhaps some settings I have aren't quite right about caching here?  Any hints/suggestions would be welcome. I'd also be interested in any hints/pointers on the internals to help me better understand how things are supposed to work.

Thanks,

Andras

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux