Re: How to improve erasure pool's delete performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



yeah, bluestore is the best solution, the rename operation almost done
in memory, so it's fast.


2016-04-03 21:07 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
> On Sun, 3 Apr 2016, huang jun wrote:
>> Hi, all
>> Recently, we test deletion performance of erasure pool, and we found
>> the delete op
>> is pretty slow.
>> We trace the code and found Delete op was converted to Rename in Filestore,
>> It was implemented in function FileStore::_collection_move_rename(),
>> it calls FileStore::_set_replay_guard function and
>> FileStore::_close_replay_guard function,
>> in these 2 function, do 3 fsyncs and 1 objectmap sync, which we think
>> spend the most time.
>> ===================================
>> Envirnoment:
>> ceph version: 0.94.5
>> linux kernel: 3.18
>> Test cluster: 1MON, 1MDS, 4OSD
>> ===================================
>> We do some comparison tests:
>> 1. sync omap + fd(default)
>> avg rename op used 0.883818s
>> 2. only sync omap: 0.428431s
>> 3. only sync fd: 0.400266s
>> 4. dont sync: 0.00319648
>> 5. do posix_fadvise(FADVISE_DONTNEED) after write, and sync omap + fd: 0.855178s
>> 6. use fdatasync to replace fsync, and sync omap + fd : 0.432659s
>>
>> As we can see, sync fd and sync objectmap use 50% of the total time each.
>> Compare 1 with 6, fdatasync uses 50% less time compared to fdatasync,
>> which means sync metadata spent more time than sync data.
>> In FileStore::_set_replay_guard function and
>> FileStore::_close_replay_guard function,
>> it only set object's user.cephos.seq xattr, then do sync to let it durable.
>>
>> I have some questions:
>> 1. Do we record the xattr(user.cephos.seq) to avoid replaying an older
>> transaction?
>
> Yes, exactly.
>
>> 2. If dont do sync, we will get the best performance, is there any
>> side effects? like: get data corrupted.
>
> Yes--we may replay incorrectly after a failure.  Unfortunately, not an
> option.
>
> This problem goes away with BlueStore, so we only have to live with it
> for a bit longer.  I don't think it's worth investing any effort into
> addressing this with FileStore--we'll be unlikely to want to merge any
> non-trivial change anyway.
>
> sage



-- 
thanks
huangjun
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux