RE: mempool tracking

Allen Samuels <Allen.Samuels@xxxxxxx> · Tue, 20 Feb 2018 21:46:46 +0000

In general, I'm in agreement with the design to avoid unnecessary branches. They seem especially annoying to put checks in the mainstream code for what ought to be appropriate initialization code.

However, in this case, I doubt you're really making a significant difference and you may be biting off more than you think in dealing with libraries, etc. So I'd recommend the stupid run-time checks. Realistically, one more run-time branch in the path of a memory allocation isn't really going to be that significant.

You may be able to eliminate one of the tests from the main path if you assume that pthread_getspecific with a virgin key (i.e., a key that's been compile-time initialized only, e.g., 0, i.e., before the needed call to pthread_create_key )  returns NULL. It's not clear from the documentation if that's the case (somebody familiar with this code might be able to chime in here...). 

Allen Samuels  
R&D Engineering Fellow 

Western Digital(r) 
Email:  allen.samuels@xxxxxxx 
Office:  +1-408-801-7030
Mobile: +1-408-780-6416 

> -----Original Message-----
> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> Sent: Tuesday, February 20, 2018 12:19 PM
> To: Allen Samuels <Allen.Samuels@xxxxxxx>
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Subject: RE: mempool tracking
> 
> On Tue, 20 Feb 2018, Allen Samuels wrote:
> > This approach will certainly convert the existing atomic
> > increment/decrement into normal increment/decrement.
> >
> > If you're willing to tie yourself to pthreads (rather than
> > compiler-based thread_local) you can avoid the leakage of per-thread
> > stat structures through the use the optional destructor in
> > pthread_key_create. This allows you to convert the existing code without
> > locating/modifying all thread creation/destruction sites.
> 
> I think I'm actually leaning toward forcing an update to the thread
> create/destroy code or entrypoints to avoid having any conditionals in the
> mempool alloc/dealloc paths.  We'll get a segv if we miss one, but it'll
> be less code in the hot paths.  I think everything still goes through
> Thread.h with the exception of rocksdb, and even that would be easy
> enough
> to change if there were callbacks or something we were worried about
> (pretty sure there aren't).
> 
> Oh... but we also have the librados/librbd/libcephfs code to think about.
> There we don't have the same control.  That might be why we ended up
> where
> we did!
> 
> sage
> 
> 
> >
> > (i.e., in "pick_a_shard()" of the current implementation, you need to test:
> >
> > (1) has pthread_key_create been invoked once in the process? (You have
> to do this here, rather than in "main" to handle the case of a global
> initialization call that allocates something in a mem pool).
> > (2) If not, create the key. [Probably don't have to lock this, because global
> initialization isn't multi-threaded, but "main" ought to do at least one
> mempool allocation to ensure that it happens before we start multi-
> threading.....]
> > (3) If pthread_getspecific(...) is null, then allocate the per-thread structure
> for this new thread and set it up.... (will need to lock for your linked list).
> >
> > That's pretty much it. Oh, don't forget the destructor :)
> >
> > Provided that the # of shards in the existing scheme is sufficiently high (so
> as to make collisions rare), then you ought to see about the same
> performance. However, since the existing sharding scheme isn't dynamic
> (i.e., same threads always get same shards, meaning that same collisions
> repeat themselves), it will eliminate the odd persistent collision that will
> happen regardless.
> >
> > I think it's a good idea and shouldn't be more than a few minutes of coding.
> >
> > Allen Samuels
> > R&D Engineering Fellow
> >
> > Western Digital(r)
> > Email:  allen.samuels@xxxxxxx
> > Office:  +1-408-801-7030
> > Mobile: +1-408-780-6416
> >
> >
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> > > Sent: Tuesday, February 20, 2018 6:55 AM
> > > To: Allen Samuels <Allen.Samuels@xxxxxxx>
> > > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > > Subject: mempool tracking
> > >
> > > Hi Allen,
> > >
> > > I was thinking a bit about the mempool tracking implementation and if
> there
> > > was a reason we couldn't do something simpler with thread local
> variables.
> > > Maybe this came up before, but I'm not remembering.  I'm thinking
> > > something like
> > >
> > > struct per_thread_stats {
> > >   int64_t num_objects, num_bytes;
> > >   ...
> > >   per_thread_stats *next;
> > > };
> > >
> > > thread_local struct per_thread_stats g_per_thread = nullptr; struct
> > > per_thread_stats *g_all_threads = nullptr; ceph_spinlock_t
> > > g_thread_list_spinlock;
> > >
> > > - The alloc/dealloc calls can just update the g_per_thread struct directly
> > > (don't even need to use atomics, probably?).
> > >
> > > - The method to read stats can walk the g_all_threads linked list - on
> thread
> > > create/exit we allocate the thread local struct and add it to the list.
> > >
> > > - spinlock would protect linked list updates.
> > >
> > > The main downside I see to this approach is that we kind of need to keep
> > > using the common/Thread wrapper to ensure the init/teardown
> happens.
> > > Anyone using std::thread would need to do it explicitly.
> > >
> > > (Or, the mempool stuff could have a check for a null value and, if so,
> allocate
> > > it dynamically. That would leave orphan items around for threads that got
> > > destroyed, though.)
> > >
> > > Anyway, this seems simpler to the hashing we do now.  It would also
> > > degenerate to a simple non-atomic inc/dec for the seastar threads, which
> > > would be nice.
> > >
> > > Am I missing something?
> > >
> > > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html