CephFS - read latency.

jesper@xxxxxxxx · Fri, 15 Feb 2019 22:04:54 +0100

Hi.

I've got a bunch of "small" files moved onto CephFS as archive/bulk storage
and now I have the backup (to tape) to spool over them. A sample of the
single-threaded backup client delivers this very consistent pattern:

$ sudo strace -T -p 7307 2>&1 | grep -A 7 -B 3 open
write(111, "\377\377\377\377", 4)       = 4 <0.000011>
openat(AT_FDCWD, "/ceph/cluster/rsyncbackups/fileshare.txt", O_RDONLY) =
38 <0.000030>
write(111, "\0\0\0\021197418 2 67201568", 21) = 21 <0.000036>
read(38,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
65536) = 65536 <0.049733>
write(111,
"\0\1\0\0CLC\0\0\0\0\2\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0"...,
65540) = 65540 <0.000037>
read(38, " $$$$$$$$$$ $$$$$$\16\33\16 $$$$$$$$\16\33"..., 65536) = 65536
<0.000199>
write(111, "\0\1\0\0 $$$$$$$$$$ $$$$$$\16\33\16 $$$$$$"..., 65540) = 65540
<0.000026>
read(38, "$ \33  \16\33\25 \33\33\33   \33\33\33
\25\0\26\2\16NVDOLOVB"..., 65536) = 65536 <0.000035>
write(111, "\0\1\0\0$ \33  \16\33\25 \33\33\33   \33\33\33
\25\0\26\2\16NVDO"..., 65540) = 65540 <0.000024>

The pattern is very consistent, thus it is not one PG or one OSD being
contented.
$ sudo strace -T -p 7307 2>&1 | grep -A 3 open  |grep read
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 11968 <0.070917>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 23232 <0.039789>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0P\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 65536 <0.053598>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 28240 <0.105046>
read(41, "NZCA_FS_CLCGENOMICS, 1, 1\nNZCA_F"..., 65536) = 73 <0.061966>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 65536 <0.050943>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
65536) = 65536 <0.031217>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 7392 <0.052612>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 288 <0.075930>
read(41, "1316919290-DASPHYNBAAAAAAPe2218b"..., 65536) = 940 <0.040609>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 22400 <0.038423>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 11984 <0.039051>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 9040 <0.054161>
read(41, "NZCA_FS_CLCGENOMICS, 1, 1\nNZCA_F"..., 65536) = 73 <0.040654>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 22352 <0.031236>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0N\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 65536 <0.123424>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 49984 <0.052249>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 28176 <0.052742>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 288 <0.092039>

Or to sum:
sudo strace -T -p 23748 2>&1 | grep -A 3 open  | grep read |  perl
-ane'/<(\d+\.\d+)>/; print $1 . "\n";' | head -n 1000 | ministat

    N           Min           Max        Median           Avg        Stddev
x 1000       3.2e-05      2.141551      0.054313   0.065834359   0.091480339

As can be seen the "initial" read averages at 65.8ms - which - if the
filesize is say 1MB and
the rest of the time is 0 - caps read performance mostly 20MB/s .. at that
pace, the journey
through double digit TB is long even with 72 OSD's backing.

Spec: Ceph Luminous 12.2.5 - Bluestore
6 OSD nodes, 10TB HDDs, 4+2 EC pool, 10GbitE

Locally the drives deliver latencies of approximately 6-8ms for a random
read. Any suggestion
on where to find out where the remaining 50ms is being spend would be
truely helpful.

Large files "just works" as read-ahead does a nice job in getting
performance up.

-- 
Jesper

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com