Re: CephFS overwrite/truncate performance hit

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 11 Feb 2019 13:01:46 -0800

On Thu, Feb 7, 2019 at 3:31 AM Hector Martin <hector@xxxxxxxxxxxxxx> wrote:
On 07/02/2019 19:47, Marc Roos wrote:

>   

> Is this difference not related to chaching? And you filling up some

> cache/queue at some point? If you do a sync after each write, do you

> have still the same results?

No, the slow operations are slow from the very beginning. It's not about 

filling a buffer/cache somewhere. I'm guessing the slow operations 

trigger several synchronous writes to the underlying OSDs, while the 

fast ones don't. But I'd like to know more about why exactly there is 

this significant performance hit to truncation operations vs. normal writes.

To give some more numbers:

echo test | dd of=b conv=notrunc

This completes extremely quickly (microseconds). The data obviously 

remains in the client cache at this point. This is what I want.

echo test | dd of=b conv=notrunc,fdatasync

This runs quickly until the fdatasync(), then that takes ~12ms, which is 

about what I'd expect for a synchronous write to the underlying HDDs. Or 

maybe that's two writes?

It's certainly one write, and may be two overlapping ones if you've extended the file and need to persist its new size (via the MDS journal).

echo test | dd of=b

This takes ~10ms in the best case for the open() call (sometimes 30-40 

or even more), and 6-8ms for the write() call.

echo test | dd of=b conv=fdatasync

This takes ~10ms for the open() call, ~8ms for the write() call, and 

~18ms for the fdatasync() call.

So it seems like truncating/recreating an existing file introduces 

several disk I/Os worth of latency and forces synchronous behavior 

somewhere down the stack, while merely creating a new file or writing to 

an existing one without truncation does not.

Right. Truncates and renames require sending messages to the MDS, and the MDS committing to RADOS (aka its disk) the change in status, before they can be completed. Creating new files will generally use a preallocated inode so it's just a network round-trip to the MDS. 

Going back to your first email, if you do an overwrite that is confined to a single stripe unit in RADOS (by default, a stripe unit is the size of your objects which is 4MB and it's aligned from 0), it is guaranteed to be atomic. CephFS can only tear writes across objects, and only if your client fails before the data has been flushed.
-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com