RE: bluestore locking

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Thu, 11 Aug 2016 20:20:19 +0000

Got it , I didn't know we are not iterating over blob_map anymore in the finish io path..Saw the latest master and it seems fine..

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@xxxxxxxxxx] 
Sent: Thursday, August 11, 2016 12:26 PM
To: Somnath Roy
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: RE: bluestore locking

On Thu, 11 Aug 2016, Somnath Roy wrote:
> Sage,
> Probably, I am missing something.
> Blob ref count will protect blob from being deleted. But, it will be 
> still erased from blob_map, right ?

The only two ways it would get removed from blob_map would be

 - Another thread is holding the collection lock and doing an update.  
This used to be a problem but isn't any more since we don't iterate over blob_map.. we have a list of BlobRefs.

 - We are tearing down the Onode because it is trimmed.  This would only happen if we trim, which won't happen because any blobs in our txc blob list will belong to onodes (or bnodes indirectly) referenced by the txc onodes list.

sage

> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> Sent: Thursday, August 11, 2016 9:58 AM
> To: Somnath Roy
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Subject: RE: bluestore locking
> 
> On Thu, 11 Aug 2016, Somnath Roy wrote:
> > <<inline with [Somnath]
> >
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> > Sent: Thursday, August 11, 2016 9:04 AM
> > To: Somnath Roy
> > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > Subject: bluestore locking
> >
> > After our conversation this mornning I went through the locking a bit more and I think we're in okay shape.  To summarize:
> >
> > These 'forward lookup' structures are protected by coll->lock:
> >
> >  Bnode::blob_map -> Blob                                    coll->lock
> >  Onode::blob_map -> Blob                                    coll->lock
> >
> > These ones are protected by cache->lock:
> >
> >  Collection::OnodeSpace::onode_map -> Onode (unordered_map)     cache->lock
> >  Blob::bc -> Buffer                                             cache->lock
> >
> > The BnodeSet is a bit different because it is depopulated when the last onode ref goes away.  But it has its own lock:
> >
> >  Collection::BnodeSet::uset -> Bnode (intrustive set)       BnodeSet::lock
> >
> > Anyway, the point of this is that the cache trim() can do everything 
> > it needs with just cache->lock.  That means that during an update, 
> > we normally have coll->lock to protect the structures we're 
> > touching, and if we are adding onodes to OnodeSpace or BufferSpace 
> > we additionally take cache->lock for the appropriate cache fragment.
> >
> > We were getting tripped up from the blob_map iteration in 
> > _txc_state_proc because we were holding no locks at all (or, 
> > previously, a collection lock that may or may not be the right one).
> > Igor's PR fixes this by making Blob refcounted and keeping a list of 
> > these.  The
> > finish_write() function takes cache->lock as needed.  Also, it's 
> > worth pointing out that the blobs that we're touching will all exist 
> > under an Onode that is in the onodes list, and it turns out that the 
> > trim() is already doing the right thing and not trimming Onodes that 
> > still have any refs.
> >
> > [Somnath] The crash we are not seeing in the current code as it is 
> > protected by first_collection lock in _txc_state_*. But, that patch 
> > opened up a deadlock like this.
> >
> > Thread 1 -> coll->Wlock -> onode->flush() -> waiting for unfinished 
> > txc Thread 2 -> _txc_finish_io -> _txc_state_* -> coll->Rlock
> >
> > There is a small window as txcs are added in the o->flush_txns list 
> > after _txc_add release coll->Wlock()
> >
> > BTW, the crash is happening as iterator got invalidated and not 
> > because blob is deleted (which eventually may hit that). Making blob 
> > ref counted will not be solving that the blob is been erased from the blob_map set.
> > We need to protect that with a lock.
> 
> Well, the blob refcounting fixes the deadlock certainly, bc we no longer need collection lock in that path.  So the question remaining is if there is a locking issue still...
> 
> > Which leaves me a bit confused as to how we originally were 
> > crashing, because we were taking the first_collection lock.  My 
> > guess is that first_collection was not the right collection and a 
> > racing update was updating the Onode.  The refcounting alone sorts 
> > this out.  My other fix would have also resolved it by taking the 
> > correct collection's lock, I think.  Unless there is another locking 
> > problem I'm not seeing.. but I think what is in master now has it covered.
> 
> And I think there isn't.  We no longer use an iterator (since we have 
> the list of blobs to fix up), and the blob finish_write stuff is 
> protected by
> cache->lock.  So I think this shoudl resolve it.  But.. whatever 
> cache->workload
> you were using that triggered this before, perhaps you can run it again?
> 
> > [Somnath] As I mentioned above , after taking first_collection lock 
> > we were not crashing anymore , it was the deadlock we were hitting. 
> > So, the fix would be to protect blob_map with a lock. If it is 
> > cool->lock we are hitting deadlock , if it is cache->lock() we need 
> > to see. The PR I sent is introducing a blob_map lock and fixing this.
> >
> > But, Mark was hitting a onode corruption and initially I thought it 
> > is because trim() is *not invoked* within coll->lock() from 
> > _osr_reap_done(). The fix in the PR seems to be working for him and 
> > if so, trim() is the one is deleting that and we don't have 
> > explanation how it happened. Is there any place in the code other 
> > than trim() can delete onode ? I couldn't find any..
> 
> What is the workload here?  And can you retest with master now?  Trim 
> shouldn't need any lock at all (it takes cache->lock on its own).  The 
> open question is whether there are other paths that are touching
> cache->lock protected things that aren't taking cache->lock.  I don't
> think so, but verifying with the problematic workloads will give us more confidence...
> 
> sage
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html