Probably not related to CephFS. Try to compare the latency you are seeing to the op_r_latency reported by the OSDs. The fast_read option on the pool can also help a lot for this IO pattern. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Fri, Feb 15, 2019 at 10:05 PM <jesper@xxxxxxxx> wrote: > > Hi. > > I've got a bunch of "small" files moved onto CephFS as archive/bulk storage > and now I have the backup (to tape) to spool over them. A sample of the > single-threaded backup client delivers this very consistent pattern: > > $ sudo strace -T -p 7307 2>&1 | grep -A 7 -B 3 open > write(111, "\377\377\377\377", 4) = 4 <0.000011> > openat(AT_FDCWD, "/ceph/cluster/rsyncbackups/fileshare.txt", O_RDONLY) = > 38 <0.000030> > write(111, "\0\0\0\021197418 2 67201568", 21) = 21 <0.000036> > read(38, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 65536) = 65536 <0.049733> > write(111, > "\0\1\0\0CLC\0\0\0\0\2\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0"..., > 65540) = 65540 <0.000037> > read(38, " $$$$$$$$$$ $$$$$$\16\33\16 $$$$$$$$\16\33"..., 65536) = 65536 > <0.000199> > write(111, "\0\1\0\0 $$$$$$$$$$ $$$$$$\16\33\16 $$$$$$"..., 65540) = 65540 > <0.000026> > read(38, "$ \33 \16\33\25 \33\33\33 \33\33\33 > \25\0\26\2\16NVDOLOVB"..., 65536) = 65536 <0.000035> > write(111, "\0\1\0\0$ \33 \16\33\25 \33\33\33 \33\33\33 > \25\0\26\2\16NVDO"..., 65540) = 65540 <0.000024> > > The pattern is very consistent, thus it is not one PG or one OSD being > contented. > $ sudo strace -T -p 7307 2>&1 | grep -A 3 open |grep read > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 11968 <0.070917> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 23232 <0.039789> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0P\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 65536 <0.053598> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 28240 <0.105046> > read(41, "NZCA_FS_CLCGENOMICS, 1, 1\nNZCA_F"..., 65536) = 73 <0.061966> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 65536 <0.050943> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 65536) = 65536 <0.031217> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 7392 <0.052612> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 288 <0.075930> > read(41, "1316919290-DASPHYNBAAAAAAPe2218b"..., 65536) = 940 <0.040609> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 22400 <0.038423> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 11984 <0.039051> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 9040 <0.054161> > read(41, "NZCA_FS_CLCGENOMICS, 1, 1\nNZCA_F"..., 65536) = 73 <0.040654> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 22352 <0.031236> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0N\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 65536 <0.123424> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 49984 <0.052249> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 28176 <0.052742> > read(41, > "CLC\0\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) > = 288 <0.092039> > > Or to sum: > sudo strace -T -p 23748 2>&1 | grep -A 3 open | grep read | perl > -ane'/<(\d+\.\d+)>/; print $1 . "\n";' | head -n 1000 | ministat > > N Min Max Median Avg Stddev > x 1000 3.2e-05 2.141551 0.054313 0.065834359 0.091480339 > > > As can be seen the "initial" read averages at 65.8ms - which - if the > filesize is say 1MB and > the rest of the time is 0 - caps read performance mostly 20MB/s .. at that > pace, the journey > through double digit TB is long even with 72 OSD's backing. > > Spec: Ceph Luminous 12.2.5 - Bluestore > 6 OSD nodes, 10TB HDDs, 4+2 EC pool, 10GbitE > > Locally the drives deliver latencies of approximately 6-8ms for a random > read. Any suggestion > on where to find out where the remaining 50ms is being spend would be > truely helpful. > > Large files "just works" as read-ahead does a nice job in getting > performance up. > > -- > Jesper > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com