On Thu, 11 Aug 2016, Somnath Roy wrote: > <<inline with [Somnath] > > -----Original Message----- > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > Sent: Thursday, August 11, 2016 9:04 AM > To: Somnath Roy > Cc: ceph-devel@xxxxxxxxxxxxxxx > Subject: bluestore locking > > After our conversation this mornning I went through the locking a bit more and I think we're in okay shape. To summarize: > > These 'forward lookup' structures are protected by coll->lock: > > Bnode::blob_map -> Blob coll->lock > Onode::blob_map -> Blob coll->lock > > These ones are protected by cache->lock: > > Collection::OnodeSpace::onode_map -> Onode (unordered_map) cache->lock > Blob::bc -> Buffer cache->lock > > The BnodeSet is a bit different because it is depopulated when the last onode ref goes away. But it has its own lock: > > Collection::BnodeSet::uset -> Bnode (intrustive set) BnodeSet::lock > > Anyway, the point of this is that the cache trim() can do everything it > needs with just cache->lock. That means that during an update, we > normally have coll->lock to protect the structures we're touching, and > if we are adding onodes to OnodeSpace or BufferSpace we additionally > take cache->lock for the appropriate cache fragment. > > We were getting tripped up from the blob_map iteration in > _txc_state_proc because we were holding no locks at all (or, previously, > a collection lock that may or may not be the right one). Igor's PR > fixes this by making Blob refcounted and keeping a list of these. The > finish_write() function takes cache->lock as needed. Also, it's worth > pointing out that the blobs that we're touching will all exist under an > Onode that is in the onodes list, and it turns out that the trim() is > already doing the right thing and not trimming Onodes that still have > any refs. > > [Somnath] The crash we are not seeing in the current code as it is > protected by first_collection lock in _txc_state_*. But, that patch > opened up a deadlock like this. > > Thread 1 -> coll->Wlock -> onode->flush() -> waiting for unfinished txc > Thread 2 -> _txc_finish_io -> _txc_state_* -> coll->Rlock > > There is a small window as txcs are added in the o->flush_txns list > after _txc_add release coll->Wlock() > > BTW, the crash is happening as iterator got invalidated and not because > blob is deleted (which eventually may hit that). Making blob ref counted > will not be solving that the blob is been erased from the blob_map set. > We need to protect that with a lock. Well, the blob refcounting fixes the deadlock certainly, bc we no longer need collection lock in that path. So the question remaining is if there is a locking issue still... > Which leaves me a bit confused as to how we originally were crashing, > because we were taking the first_collection lock. My guess is that > first_collection was not the right collection and a racing update was > updating the Onode. The refcounting alone sorts this out. My other fix > would have also resolved it by taking the correct collection's lock, I > think. Unless there is another locking problem I'm not seeing.. but I > think what is in master now has it covered. And I think there isn't. We no longer use an iterator (since we have the list of blobs to fix up), and the blob finish_write stuff is protected by cache->lock. So I think this shoudl resolve it. But.. whatever workload you were using that triggered this before, perhaps you can run it again? > [Somnath] As I mentioned above , after taking first_collection lock we > were not crashing anymore , it was the deadlock we were hitting. So, the > fix would be to protect blob_map with a lock. If it is cool->lock we are > hitting deadlock , if it is cache->lock() we need to see. The PR I sent > is introducing a blob_map lock and fixing this. > > But, Mark was hitting a onode corruption and initially I thought it is > because trim() is *not invoked* within coll->lock() from > _osr_reap_done(). The fix in the PR seems to be working for him and if > so, trim() is the one is deleting that and we don't have explanation how > it happened. Is there any place in the code other than trim() can delete > onode ? I couldn't find any.. What is the workload here? And can you retest with master now? Trim shouldn't need any lock at all (it takes cache->lock on its own). The open question is whether there are other paths that are touching cache->lock protected things that aren't taking cache->lock. I don't think so, but verifying with the problematic workloads will give us more confidence... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html