---- Sage Weil编写 ---- > On Tue, 21 Apr 2015, Chen, Xiaoxi wrote: > > Haomai is right in theory, but I am not sure whether all > > user(mon,filestore,kvstore) of submit_transaction API clearly holding > > the expectation that their data is not persistent and may lost in > > failure. So in rocksdb now the sync is default to true even in > > submit_transaction(and this option make the two api exactly the same). > > Maybe we need to rename the api to > > submit_transaction_persistent/nonpersistent to better discribe the > > behavior? > > Let's audit them, then.. I think they are right, but we may as well > confirm! > > Again, FileStore is the odd one out here because it is relying on the > syncfs(2) at commit time for everything. > Yes, so maybe we dont need to expose the option to user, we can decide whether to.sync in code logic. I remember some folks in out team tried to move KVDB to a partition on SSD while leave other filestore data on HDD, in my memory it benifit performance. This deployment is problematic with kv_sync=false. gWill check the data first and then we can evaluate whethe we want to support this kind of deployment. > > And yeah, whether a sync(persistent) transaction success can persist the > > previous non-sync(unpersistent) transaction is the main issue here. > > Yeah. I posted to facebook, no reply yet! > > s > > > > > ---- Sage Weil?? ---- > > > > > > On Tue, 21 Apr 2015, Haomai Wang wrote: > > > On Tue, Apr 21, 2015 at 2:43 PM, Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx> wrote: > > > > Hi Sage, > > > > Well, that's > > > > submit_transaction -- submit a transaction , whether block waiting for fdatasync depends on rocksdb-disable-sync. > > > > submit_transaction_sync -- queue transaction and wait until it is stable on disk. > > > > So if we default rocksdb-disable-sync to false, the two API are same. I haven't look at the LevelDB but I suspect it's similar. > > > > > > Eh, I don't think it's the same. By default WriteOption.disableWAL is > > > false in our ceph side, and submit_transaction will use > > > WriteOption.sync=false and submit_transaction_sync will use > > > WriteOption.sync=true. > > > > > > If sync==fase, rocksdb won't sync log file, otherwise it will call > > > fsync/fdatasync to flush log file. > > > > > > Plz correct me if not. :-) > > > > That's what it looks like to me, too. I think the disable sync is a > > separate optimization for bulk data loading that only filestore wants > > (because it calls sync(2); we should probably set the option explicitly in > > FileStore.cc instead of exposing as a ceph option?). > > > > I think the issue with what we have now is that a > > submit_transaction_sync() with an empty transaction may not sync previous > > transactions if the log rolled over? I would really expect that it would, > > though... :/ I'll ask on the rocksdb facebook page. > > > > sage > > > > > > > > > > > > > > > I just re-read the Newstore code, seems the workflow is not as that we want. We issue a bunch of submit_transaction and in the _kv_sync_thread we try to have a checkpoint that ensure previous transaction are persistent, by using submit_transcation_sync to submit an empty transaction. But actually > > > > 1. the submit_transaction is already a synchronized call so the empty transcation in _kv_sync_thread is kind of waste. > > > > 2. An sync transaction cannot ensure the previous transaction is also synced. The API doesn't guarantee this, and from implementation, this two transactions may goes to different WAL files. > > > > > > > > Yes, if we want, we can have a Queue and Thread that collecting the transactions and merge them to a big transaction , some ::fdatasync will be saved here. But this approach looks complex. > > > > > > > > Some optimizations in my mind are: > > > > 1. Batch the cleanup operations in _apply_wal_transaction, we don?t need to synchronized remove the WAL item, we can just put them into kv_sync_thread_Q and let kv_sync_thread to form a batch transaction that deleted a bunch of key. > > > > 2. We don't need the empty transaction in kv_sync_thread, we could call the _txc_kv_finish_kv directly from _txc_submit_kv, since the KV is synchronized. > > > > 3. Then we can rename _kv_sync_thread to _kv_cleanup_thread to better descript its work. > > > > > > > > How do you think > > > > > > > > Xiaoxi > > > > -----Original Message----- > > > > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > > > > Sent: Tuesday, April 21, 2015 12:48 AM > > > > To: Chen, Xiaoxi > > > > Cc: Mark Nelson; Somnath Roy; Duan, Jiangang; Zhang, Jian; ceph-devel > > > > Subject: Re: ??: Re: NewStore performance analysis > > > > > > > > On Mon, 20 Apr 2015, Chen, Xiaoxi wrote: > > > >> > > An easy way to measure might be comment out > > > >> > > db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to > > > >> > > db->see if > > > >> > > we can get more QD in fragment part without issuing the DB. > > > >> > > > > >> > I'm not sure I totally understand the interface.. my assumption is > > > >> > that queue_transaction will give rocksdb the txn to commit whenever > > > >> > it finds it convenient (no idea what policy is used there) and > > > >> > queue_transaction_sync will trigger a commit now. If we did have > > > >> > multiple threads doing queue_trandsaction_sync (by, say, calling it > > > >> > directly in _txc_submit_kv) would qa go up? > > > >> > > > > >> I think you might miss something, currently the two interface are > > > >> exactly the SAME unless you set rocksdb-disable-sync=true(which is > > > >> false by default). > > > >> > > > >> When commit, rocksdb will write the content to both memtable(write > > > >> buffer) and WAL. if the transaction doesnt go with sync, it will also > > > >> commit now,but the write to WAL will NOT be sync(by calling fdatasync). > > > >> That means we may lose data if power failure/kernel panic. This is why > > > >> i changed the default rocksdb-disable-sync from true to false in > > > >> previous patch. > > > > > > > > Yeah, I'm confused. :) > > > > > > > > So now 'rocksdb disable sync = false', which seems to be obviously what we want for newstore. It's different for filestore, which is doing a syncfs checkpoint. Perhaps we should have newstore set that explicitly instead of passing through a config option. > > > > > > > > In any case, though, I'm confused by > > > > > > > >> if the transaction doesnt go with sync, it will also commit now,but > > > >> the write to WAL will NOT be sync(by calling fdatasync). > > > > > > > > What does it mean to 'commit' but not call fdatasync? What does commit mean in this case? > > > > > > > > And, and I correct in understanding that we have > > > > > > > > queue_transaction -- queue a transaction but don't block waiting for fdatasync queue_transaction_sync -- queue transaction and wait until it is stable on disk > > > > > > > > to work with? > > > > > > > > Thanks! > > > > sage > > > > > > > > > > > > -- > > > Best Regards, > > > > > > Wheat > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > >��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f