Re: CephFS overwrite/truncate performance hit

Hector Martin <hector@xxxxxxxxxxxxxx> · Tue, 12 Feb 2019 22:10:32 +0900

On 12/02/2019 06:01, Gregory Farnum wrote:
Right. Truncates and renames require sending messages to the MDS, and 
the MDS committing to RADOS (aka its disk) the change in status, before 
they can be completed. Creating new files will generally use a 
preallocated inode so it's just a network round-trip to the MDS.

I see. Is there a fundamental reason why these kinds of metadata 
operations cannot be buffered in the client, or is this just the current 
way they're implemented?

e.g. on a local FS these kinds of writes can just stick around in the 
block cache unflushed. And of course for CephFS I assume file extension 
also requires updating the file size in the MDS, yet that doesn't block 
while truncation does.

Going back to your first email, if you do an overwrite that is confined 
to a single stripe unit in RADOS (by default, a stripe unit is the size 
of your objects which is 4MB and it's aligned from 0), it is guaranteed 
to be atomic. CephFS can only tear writes across objects, and only if 
your client fails before the data has been flushed.

Great! I've implemented this in a backwards-compatible way, so that gets 
rid of this bottleneck. It's just a 128-byte flag file (formerly 
variable length, now I just pad it to the full 128 bytes and rewrite it 
in-place). This is good information to know for optimizing things :-)

--
Hector Martin (hector@xxxxxxxxxxxxxx)
Public Key: https://mrcn.st/pub
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com