回复: Re: 回复: Re: 回复: Re: NewStore performance analysis

"Chen, Xiaoxi" <xiaoxi.chen@xxxxxxxxx> · Tue, 21 Apr 2015 23:47:48 +0000

---- Sage Weil编写 ----

> On Tue, 21 Apr 2015, Chen, Xiaoxi wrote:
> > Haomai is right in theory, but I am not sure whether all 
> > user(mon,filestore,kvstore) of submit_transaction API clearly holding 
> > the expectation that their data is not persistent and may lost in 
> > failure.  So in rocksdb now the sync is default to true even in 
> > submit_transaction(and this option make the two api exactly the same). 
> > Maybe we need to rename the api to 
> > submit_transaction_persistent/nonpersistent to better discribe the 
> > behavior?
> 
> Let's audit them, then.. I think they are right, but we may as well 
> confirm!
> 
> Again, FileStore is the odd one out here because it is relying on the 
> syncfs(2) at commit time for everything.
> 

Yes, so maybe we dont need to expose the option to user, we can decide whether to.sync in code logic.

I remember some folks in out team tried to move KVDB to a partition on SSD while leave other filestore data on HDD, in my memory it benifit performance.  This deployment is problematic with kv_sync=false.  gWill check the data first and then we can evaluate whethe we want to support this kind of deployment.

> > And yeah, whether a sync(persistent) transaction success can persist the 
> > previous non-sync(unpersistent) transaction is the main issue here.
> 
> Yeah. I posted to facebook, no reply yet!
> 
> s
> 
> > 
> > ---- Sage Weil?? ----
> > 
> > 
> > On Tue, 21 Apr 2015, Haomai Wang wrote:
> > > On Tue, Apr 21, 2015 at 2:43 PM, Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx> wrote:
> > > > Hi Sage,
> > > >         Well, that's
> > > >                 submit_transaction -- submit a transaction , whether block waiting for fdatasync depends on rocksdb-disable-sync.
> > > >                 submit_transaction_sync -- queue transaction and wait until it is stable on disk.
> > > >         So if we default rocksdb-disable-sync to false, the two API are same. I haven't look at the LevelDB but I suspect it's similar.
> > >
> > > Eh, I don't think it's the same. By default WriteOption.disableWAL is
> > > false in our ceph side, and submit_transaction will use
> > > WriteOption.sync=false and submit_transaction_sync will use
> > > WriteOption.sync=true.
> > >
> > > If sync==fase, rocksdb won't sync log file, otherwise it will call
> > > fsync/fdatasync to flush log file.
> > >
> > > Plz correct me if not. :-)
> > 
> > That's what it looks like to me, too.  I think the disable sync is a
> > separate optimization for bulk data loading that only filestore wants
> > (because it calls sync(2); we should probably set the option explicitly in
> > FileStore.cc instead of exposing as a ceph option?).
> > 
> > I think the issue with what we have now is that a
> > submit_transaction_sync() with an empty transaction may not sync previous
> > transactions if the log rolled over?  I would really expect that it would,
> > though... :/  I'll ask on the rocksdb facebook page.
> > 
> > sage
> > 
> > 
> > >
> > > >
> > > >         I just re-read the Newstore code, seems the workflow is not as that we want. We issue a bunch of submit_transaction and in the _kv_sync_thread we try to have a checkpoint that ensure previous transaction are persistent, by using submit_transcation_sync to submit an empty transaction.  But actually
> > > >         1. the submit_transaction is already a synchronized call so the empty transcation in _kv_sync_thread is kind of waste.
> > > >         2. An sync transaction cannot ensure the previous transaction is also synced. The API doesn't guarantee this, and from implementation, this two transactions may goes to different WAL files.
> > > >
> > > >         Yes, if we want, we can have a Queue and Thread that collecting the transactions and merge them to a big transaction , some ::fdatasync will be saved here. But this approach looks complex.
> > > >
> > > >         Some optimizations in my mind are:
> > > >         1. Batch the cleanup operations in _apply_wal_transaction, we don?t need to synchronized remove the WAL item, we can just put them into kv_sync_thread_Q and let kv_sync_thread to form a batch transaction that deleted a bunch of key.
> > > >         2. We don't need the empty transaction in kv_sync_thread, we could call the _txc_kv_finish_kv directly from _txc_submit_kv,  since the KV is synchronized.
> > > >                 3.  Then we can rename _kv_sync_thread to _kv_cleanup_thread to better descript its work.
> > > >
> > > >         How do you think
> > > >
> > > >                                                                                                                         Xiaoxi
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> > > > Sent: Tuesday, April 21, 2015 12:48 AM
> > > > To: Chen, Xiaoxi
> > > > Cc: Mark Nelson; Somnath Roy; Duan, Jiangang; Zhang, Jian; ceph-devel
> > > > Subject: Re: ??: Re: NewStore performance analysis
> > > >
> > > > On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
> > > >> > > An easy way to measure might be comment out
> > > >> > > db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to
> > > >> > > db->see if
> > > >> > > we can get more QD in fragment part without issuing the DB.
> > > >> >
> > > >> > I'm not sure I totally understand the interface.. my assumption is
> > > >> > that queue_transaction will give rocksdb the txn to commit whenever
> > > >> > it finds it convenient (no idea what policy is used there) and
> > > >> > queue_transaction_sync will trigger a commit now.  If we did have
> > > >> > multiple threads doing queue_trandsaction_sync (by, say, calling it
> > > >> > directly in _txc_submit_kv) would qa go up?
> > > >> >
> > > >> I think you might miss something, currently the two interface are
> > > >> exactly the SAME unless you set rocksdb-disable-sync=true(which is
> > > >> false by default).
> > > >>
> > > >> When commit, rocksdb will write the content to both memtable(write
> > > >> buffer) and WAL. if the transaction doesnt go with sync, it will also
> > > >> commit now,but the write to WAL will NOT be sync(by calling fdatasync).
> > > >> That means we may lose data if power failure/kernel panic. This is why
> > > >> i changed the default rocksdb-disable-sync from true to false in
> > > >> previous patch.
> > > >
> > > > Yeah, I'm confused.  :)
> > > >
> > > > So now 'rocksdb disable sync = false', which seems to be obviously what we want for newstore.  It's different for filestore, which is doing a syncfs checkpoint.  Perhaps we should have newstore set that explicitly instead of passing through a config option.
> > > >
> > > > In any case, though, I'm confused by
> > > >
> > > >> if the transaction doesnt go with sync, it will also commit now,but
> > > >> the write to WAL will NOT be sync(by calling fdatasync).
> > > >
> > > > What does it mean to 'commit' but not call fdatasync?  What does commit mean in this case?
> > > >
> > > > And, and I correct in understanding that we have
> > > >
> > > >  queue_transaction -- queue a transaction but don't block waiting for fdatasync  queue_transaction_sync -- queue transaction and wait until it is stable on disk
> > > >
> > > > to work with?
> > > >
> > > > Thanks!
> > > > sage
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > >
> > > Wheat
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> >��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f