Re: How to improve erasure pool's delete performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, 3 Apr 2016, huang jun wrote:
> Hi, all
> Recently, we test deletion performance of erasure pool, and we found
> the delete op
> is pretty slow.
> We trace the code and found Delete op was converted to Rename in Filestore,
> It was implemented in function FileStore::_collection_move_rename(),
> it calls FileStore::_set_replay_guard function and
> FileStore::_close_replay_guard function,
> in these 2 function, do 3 fsyncs and 1 objectmap sync, which we think
> spend the most time.
> ===================================
> Envirnoment:
> ceph version: 0.94.5
> linux kernel: 3.18
> Test cluster: 1MON, 1MDS, 4OSD
> ===================================
> We do some comparison tests:
> 1. sync omap + fd(default)
> avg rename op used 0.883818s
> 2. only sync omap: 0.428431s
> 3. only sync fd: 0.400266s
> 4. dont sync: 0.00319648
> 5. do posix_fadvise(FADVISE_DONTNEED) after write, and sync omap + fd: 0.855178s
> 6. use fdatasync to replace fsync, and sync omap + fd : 0.432659s
> 
> As we can see, sync fd and sync objectmap use 50% of the total time each.
> Compare 1 with 6, fdatasync uses 50% less time compared to fdatasync,
> which means sync metadata spent more time than sync data.
> In FileStore::_set_replay_guard function and
> FileStore::_close_replay_guard function,
> it only set object's user.cephos.seq xattr, then do sync to let it durable.
> 
> I have some questions:
> 1. Do we record the xattr(user.cephos.seq) to avoid replaying an older
> transaction?

Yes, exactly.

> 2. If dont do sync, we will get the best performance, is there any
> side effects? like: get data corrupted.

Yes--we may replay incorrectly after a failure.  Unfortunately, not an 
option.

This problem goes away with BlueStore, so we only have to live with it 
for a bit longer.  I don't think it's worth investing any effort into 
addressing this with FileStore--we'll be unlikely to want to merge any 
non-trivial change anyway.

sage

[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux