Re: cephfs - modifying the ceph.file.layout of existing files

Andrej Filipcic <andrej.filipcic@xxxxxx> · Sun, 31 May 2020 08:54:43 +0200

Hi,

pwrite is fast, thoug it's async and it takes ~12s to flush:

time /usr/sbin/xfs_io -f -c "pwrite 0 4568293535" 
/ceph/grid/cacheslow/data/aa/007494abc70353384e4788624d19fe268db8ce
wrote 4568293535/4568293535 bytes at offset 0
4.255 GiB, 1115307 ops; 0:00:04.39 (992.045 MiB/sec and 253963.6456 ops/sec)

real    0m4.399s
user    0m0.160s
sys     0m4.183s

but the copy_range is very slow:
f9sn001 ~ # time /usr/sbin/xfs_io -f -c "copy_range -s 0 -d 0 -l 
4568293535 
/ceph/grid/cache/data/aa/007494abc70353384e4788624d19fe268db8ce" 
/ceph/grid/cacheslow/data/aa/007494abc70353384e4788624d19fe268db8ce

real    5m20.007s
user    0m0.001s
sys     0m0.654s

about 13MB/s.  Any option to speed that up apart from doing it in 
parallel on many files? Some osd settings that could help?
fast cache is quite loaded, but it can still read 200MB/s to the client, 
slow cache is practically not used.

I am testing it on 5.6.13 kernel with copyfrom mount option and on 
octopus 15.2.2 with bluefs_preextend_wal_files=false

Cheers,
Andrej

On 2020-05-28 14:07, Andrej Filipcic wrote:

Thanks a lot, I will give it a try, I plan to use that in a very 
controlled environment anyway.

Best regards,
Andrej

On 2020-05-28 12:21, Luis Henriques wrote:
Andrej Filipcic <andrej.filipcic@xxxxxx> writes:

Hi,

I have two directories, cache_fast and cache_slow, and I would like 
to move the
least used files from fast to slow, aka, user side tiering. 
cache_fast is pinned
to fast_data ssd pool, while cache_slow to hdd cephfs_data pool.

$ getfattr -n ceph.dir.layout /ceph/grid/cache_fast
getfattr: Removing leading '/' from absolute path names
# file: ceph/grid/cache_fast
ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304
pool=fast_data"

$ getfattr -n ceph.dir.layout /ceph/grid/cache_slow
getfattr: Removing leading '/' from absolute path names
# file: ceph/grid/cache_slow
ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304
pool=cephfs_data"

"mv" from cache_fast dir to cache_slow dir only renames the file in 
mds, but
does not involve migration to a different pool and changing the file 
layout.

The only option I see at this point is to "cp" the file to a new dir 
and
removing it from the old one, but this would involve client side 
operations and
can be very slow.

Is there any better way, that would work ceph server side?
I'm afraid I can't give you a full solution for your problem; there's
probably an easy way for achieving what you want, but I just wanted 
to let
you know a solution that *may* help in very specific scenarios, 
namely if:

1. you're using a (recent) kernel client to mount the filesystem
2. your files are all bigger than the object_size and, ideally, their
    sizes are multiple of object_size
3. stripe_count is *always* 1

Assuming all the above are true, you can (partially) offload the files
copy to the OSDs by using the copy_file_range(2) syscall.  I don't think
tools such as 'cp' will use this syscall so you may need to find
alternatives.  Here's a simple example using xfs_io:

   # create a file with 4 objects (4 * 4194304)
   xfs_io -f -c "pwrite 0 16777216" /ceph/grid/cache_fast/oldfile
   # copy the 4 objects file, preserving layout
   xfs_io -f -c "copy_range -s 0 -d 0 -l 16777216 \
       /ceph/grid/cache_slow/oldfile /ceph/grid/cache_slow/newfile
   rm /ceph/grid/cache_fast/oldfile

What will happen in this example is that the file data for oldfile will
never be read into the client.  Instead, the client will be sending
'object copy' requests to the OSDs.  Also, because this is effectively a
copy, the new file will have the layout you expect.

Finally, because the copy_file_range syscall in cephfs is disabled by
default, you'll need to have the filesystem mounted with the "copyfrom"
mount option.

So, as I said, not a real solution, but a way to eventually implement
one.

Cheers,

--
_____________________________________________________________
   prof. dr. Andrej Filipcic,   E-mail: Andrej.Filipcic@xxxxxx
   Department of Experimental High Energy Physics - F9
   Jozef Stefan Institute, Jamova 39, P.o.Box 3000
   SI-1001 Ljubljana, Slovenia
   Tel.: +386-1-477-3674    Fax: +386-1-425-7074
-------------------------------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx