RE: 回复: Re: NewStore performance analysis

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sage,
	Well, that's 
		submit_transaction -- submit a transaction , whether block waiting for fdatasync depends on rocksdb-disable-sync.  
		submit_transaction_sync -- queue transaction and wait until it is stable on disk.
	So if we default rocksdb-disable-sync to false, the two API are same. I haven't look at the LevelDB but I suspect it's similar.

	I just re-read the Newstore code, seems the workflow is not as that we want. We issue a bunch of submit_transaction and in the _kv_sync_thread we try to have a checkpoint that ensure previous transaction are persistent, by using submit_transcation_sync to submit an empty transaction.  But actually 
	1. the submit_transaction is already a synchronized call so the empty transcation in _kv_sync_thread is kind of waste.
	2. An sync transaction cannot ensure the previous transaction is also synced. The API doesn't guarantee this, and from implementation, this two transactions may goes to different WAL files.  

	Yes, if we want, we can have a Queue and Thread that collecting the transactions and merge them to a big transaction , some ::fdatasync will be saved here. But this approach looks complex. 

	Some optimizations in my mind are:
	1. Batch the cleanup operations in _apply_wal_transaction, we don’t need to synchronized remove the WAL item, we can just put them into kv_sync_thread_Q and let kv_sync_thread to form a batch transaction that deleted a bunch of key.
	2. We don't need the empty transaction in kv_sync_thread, we could call the _txc_kv_finish_kv directly from _txc_submit_kv,  since the KV is synchronized.
              	3.  Then we can rename _kv_sync_thread to _kv_cleanup_thread to better descript its work. 

	How do you think

															Xiaoxi
-----Original Message-----
From: Sage Weil [mailto:sweil@xxxxxxxxxx] 
Sent: Tuesday, April 21, 2015 12:48 AM
To: Chen, Xiaoxi
Cc: Mark Nelson; Somnath Roy; Duan, Jiangang; Zhang, Jian; ceph-devel
Subject: Re: 回复: Re: NewStore performance analysis

On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
> > > An easy way to measure might be comment out
> > > db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to 
> > > db->see if
> > > we can get more QD in fragment part without issuing the DB.
> > 
> > I'm not sure I totally understand the interface.. my assumption is 
> > that queue_transaction will give rocksdb the txn to commit whenever 
> > it finds it convenient (no idea what policy is used there) and 
> > queue_transaction_sync will trigger a commit now.  If we did have 
> > multiple threads doing queue_trandsaction_sync (by, say, calling it 
> > directly in _txc_submit_kv) would qa go up?
> > 
> I think you might miss something, currently the two interface are 
> exactly the SAME unless you set rocksdb-disable-sync=true(which is 
> false by default).
> 
> When commit, rocksdb will write the content to both memtable(write
> buffer) and WAL. if the transaction doesnt go with sync, it will also 
> commit now,but the write to WAL will NOT be sync(by calling fdatasync).
> That means we may lose data if power failure/kernel panic. This is why 
> i changed the default rocksdb-disable-sync from true to false in 
> previous patch.

Yeah, I'm confused.  :)

So now 'rocksdb disable sync = false', which seems to be obviously what we want for newstore.  It's different for filestore, which is doing a syncfs checkpoint.  Perhaps we should have newstore set that explicitly instead of passing through a config option.

In any case, though, I'm confused by

> if the transaction doesnt go with sync, it will also commit now,but 
> the write to WAL will NOT be sync(by calling fdatasync).

What does it mean to 'commit' but not call fdatasync?  What does commit mean in this case?

And, and I correct in understanding that we have

 queue_transaction -- queue a transaction but don't block waiting for fdatasync  queue_transaction_sync -- queue transaction and wait until it is stable on disk

to work with?

Thanks!
sage
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f





[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux