Re: CephFS overwrite/truncate performance hit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 7, 2019 at 3:31 AM Hector Martin <hector@xxxxxxxxxxxxxx> wrote:
On 07/02/2019 19:47, Marc Roos wrote:
>   
> Is this difference not related to chaching? And you filling up some
> cache/queue at some point? If you do a sync after each write, do you
> have still the same results?

No, the slow operations are slow from the very beginning. It's not about
filling a buffer/cache somewhere. I'm guessing the slow operations
trigger several synchronous writes to the underlying OSDs, while the
fast ones don't. But I'd like to know more about why exactly there is
this significant performance hit to truncation operations vs. normal writes.

To give some more numbers:

echo test | dd of=b conv=notrunc

This completes extremely quickly (microseconds). The data obviously
remains in the client cache at this point. This is what I want.

echo test | dd of=b conv=notrunc,fdatasync

This runs quickly until the fdatasync(), then that takes ~12ms, which is
about what I'd expect for a synchronous write to the underlying HDDs. Or
maybe that's two writes?

It's certainly one write, and may be two overlapping ones if you've extended the file and need to persist its new size (via the MDS journal).
 
 

echo test | dd of=b

This takes ~10ms in the best case for the open() call (sometimes 30-40
or even more), and 6-8ms for the write() call.

echo test | dd of=b conv=fdatasync

This takes ~10ms for the open() call, ~8ms for the write() call, and
~18ms for the fdatasync() call.

So it seems like truncating/recreating an existing file introduces
several disk I/Os worth of latency and forces synchronous behavior
somewhere down the stack, while merely creating a new file or writing to
an existing one without truncation does not.

Right. Truncates and renames require sending messages to the MDS, and the MDS committing to RADOS (aka its disk) the change in status, before they can be completed. Creating new files will generally use a preallocated inode so it's just a network round-trip to the MDS. 

Going back to your first email, if you do an overwrite that is confined to a single stripe unit in RADOS (by default, a stripe unit is the size of your objects which is 4MB and it's aligned from 0), it is guaranteed to be atomic. CephFS can only tear writes across objects, and only if your client fails before the data has been flushed.
-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux