RE: bluestore locking

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Thu, 11 Aug 2016 16:51:14 +0000

<<inline with [Somnath]

-----Original Message-----
From: Sage Weil [mailto:sweil@xxxxxxxxxx]
Sent: Thursday, August 11, 2016 9:04 AM
To: Somnath Roy
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: bluestore locking

After our conversation this mornning I went through the locking a bit more and I think we're in okay shape.  To summarize:

These 'forward lookup' structures are protected by coll->lock:

 Bnode::blob_map -> Blob                                    coll->lock
 Onode::blob_map -> Blob                                    coll->lock

These ones are protected by cache->lock:

 Collection::OnodeSpace::onode_map -> Onode (unordered_map)     cache->lock
 Blob::bc -> Buffer                                             cache->lock

The BnodeSet is a bit different because it is depopulated when the last onode ref goes away.  But it has its own lock:

 Collection::BnodeSet::uset -> Bnode (intrustive set)       BnodeSet::lock

Anyway, the point of this is that the cache trim() can do everything it needs with just cache->lock.  That means that during an update, we normally have coll->lock to protect the structures we're touching, and if we are adding onodes to OnodeSpace or BufferSpace we additionally take
cache->lock for the appropriate cache fragment.

We were getting tripped up from the blob_map iteration in _txc_state_proc because we were holding no locks at all (or, previously, a collection lock that may or may not be the right one).  Igor's PR fixes this by making Blob refcounted and keeping a list of these.  The finish_write() function takes cache->lock as needed.  Also, it's worth pointing out that the blobs that we're touching will all exist under an Onode that is in the onodes list, and it turns out that the trim() is already doing the right thing and not trimming Onodes that still have any refs.

[Somnath] The crash we are not seeing in the current code as it is protected by first_collection lock in _txc_state_*. But, that patch opened up a deadlock like this.

Thread 1 -> coll->Wlock -> onode->flush() -> waiting for unfinished txc
Thread 2 -> _txc_finish_io -> _txc_state_* -> coll->Rlock

There is a small window as txcs are added in the o->flush_txns list after _txc_add release coll->Wlock()

BTW, the crash is happening as iterator got invalidated and not because blob is deleted (which eventually may hit that). Making blob ref counted will not be solving that the blob is been erased from the blob_map set. We need to protect that with a lock.

Which leaves me a bit confused as to how we originally were crashing, because we were taking the first_collection lock.  My guess is that first_collection was not the right collection and a racing update was updating the Onode.  The refcounting alone sorts this out.  My other fix would have also resolved it by taking the correct collection's lock, I think.  Unless there is another locking problem I'm not seeing..
but I think what is in master now has it covered.

[Somnath] As I mentioned above , after taking first_collection lock we were not crashing anymore , it was the deadlock we were hitting. So, the fix would be to protect blob_map with a lock. If it is cool->lock we are hitting deadlock , if it is cache->lock() we need to see. The PR I sent is introducing a blob_map lock and fixing this.

But, Mark was hitting a onode corruption and initially I thought it is because trim() is *not invoked* within coll->lock() from _osr_reap_done(). The fix in the PR seems to be working for him and if so, trim() is the one is deleting that and we don't have explanation how it happened. Is there any place in the code other than trim() can delete onode ? I couldn't find any..

Igor, Somnath, does the current strategy make sense?

sage

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html