On Fri, Jan 12, 2018 at 1:37 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > A dream of mine is to kill off the onreadable callback in ObjectStore > completely. This is *mostly* a non-issue with BlueStore with a few > exceptions (creates/deletes still need a flush before they are visible to > collection_list()), but because FileStore has no in-memory cache/state > about it's changes it is a long way from being possible. > > My thinking recently has been that at some point we'll reach a point where > we care more about OSD code simplicity than FileStore performance > (presumably at some point after we move a bunch of users over to > BlueStore) and then we implement some kludge in FileStore to hide > this. > > Today I was looking at a weird ECBackend issue and it looks like there > we're still doing dual acks for commit and readable... all in order to > make sure that a read that follows a write will order properly (even on > replicas). That means double the messages over the wire to track > something that can be done with a simple map of in-filght objects in > memory. But instead of implementing that just in ECBackend, implementing > the same thing in FileStore would kill 100 birds with one stone. > > The naive concept is that when you queue_transaction, we put an item in > hash tale in the Sequencer indicating a write is in flight for all oids > referenced by the Transaction. When we complete, we remove it from the > hash map. Probably unordered_map<ghobject_t,int> where it's just a > counter. Then, on read ops, we do a wait_for_object(oid) that waits for > that count to be zero. This is basically what the ECBackend-specific fix > would need to do in order to get rid of the messages over the wire (and > that trade-off is a no-brainer). Can continuous consecutive writes to the same oid lead to starvation? > > For ReplicatedPG, it's less obvious, since we already have per-object > state in memory that is tracking in-flight requests. The naive filestore > implementation adds a hash map insert and removal + plus some lookups for > read ops with the allocation of the callback, and queueing and executing > the callback on io apply/completion. Not sure that's a win for the > simple implementation. > > But: what if we make the tracking super lightweight and efficient? I'm > thinking this, in the OpSequencer: > > std::atomic<int> inflight_by_hash[128]; > > where we hash the object id and increment the atomic slot in appropriate > slot, then decrement on completion. That's basically 0 overhead for > tracking in the write path. And for the read path, we just test the slot > and (if it's 0) proceed. If it's not zero, we take the OpSequencer lock > and do a more traditional cond wait on the condition (the write path is > already taking the qlock on completion so it can signal as needed). > > This is probabilistic, meaning that if you have a lot of IO in flight on a > sequencer (== PG) when you issue a read you might wait for an unrelated > object you don't need to, but with low probability. We can tune the size > of the counter vector to balance that probability (and long-tail read > latencies in mixed read/write workloads) with memory usage (and I guess L1 > misses that might slow down flash read workloads a bit). > > In return, we completely eliminate the ObjectStore onreadable callback > infrastructure and all of the grossness in OSD that uses it. > > What do you think? > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html