yeah, bluestore is the best solution, the rename operation almost done in memory, so it's fast. 2016-04-03 21:07 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>: > On Sun, 3 Apr 2016, huang jun wrote: >> Hi, all >> Recently, we test deletion performance of erasure pool, and we found >> the delete op >> is pretty slow. >> We trace the code and found Delete op was converted to Rename in Filestore, >> It was implemented in function FileStore::_collection_move_rename(), >> it calls FileStore::_set_replay_guard function and >> FileStore::_close_replay_guard function, >> in these 2 function, do 3 fsyncs and 1 objectmap sync, which we think >> spend the most time. >> =================================== >> Envirnoment: >> ceph version: 0.94.5 >> linux kernel: 3.18 >> Test cluster: 1MON, 1MDS, 4OSD >> =================================== >> We do some comparison tests: >> 1. sync omap + fd(default) >> avg rename op used 0.883818s >> 2. only sync omap: 0.428431s >> 3. only sync fd: 0.400266s >> 4. dont sync: 0.00319648 >> 5. do posix_fadvise(FADVISE_DONTNEED) after write, and sync omap + fd: 0.855178s >> 6. use fdatasync to replace fsync, and sync omap + fd : 0.432659s >> >> As we can see, sync fd and sync objectmap use 50% of the total time each. >> Compare 1 with 6, fdatasync uses 50% less time compared to fdatasync, >> which means sync metadata spent more time than sync data. >> In FileStore::_set_replay_guard function and >> FileStore::_close_replay_guard function, >> it only set object's user.cephos.seq xattr, then do sync to let it durable. >> >> I have some questions: >> 1. Do we record the xattr(user.cephos.seq) to avoid replaying an older >> transaction? > > Yes, exactly. > >> 2. If dont do sync, we will get the best performance, is there any >> side effects? like: get data corrupted. > > Yes--we may replay incorrectly after a failure. Unfortunately, not an > option. > > This problem goes away with BlueStore, so we only have to live with it > for a bit longer. I don't think it's worth investing any effort into > addressing this with FileStore--we'll be unlikely to want to merge any > non-trivial change anyway. > > sage -- thanks huangjun -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html