On 28/03/2018, Mark Nelson wrote: > I sort of have semi-competing thoughts: > > 1) Maybe it makes sense that rocksdb should be able to determine that a > given key is short lived and shouldn't make it into L0 at all but you still > want to batch it in with a transaction to the WAL and archive the whole log > as-is until tombstones for all remaining log entries are encountered. > Basically the idea that I mentioned in the other reply. This arguably goes > beyond Ceph and is more about how RocksDB treats short lived data. Our > design more or less remains the same except that we tell rocksdb that some > classes of keys are short lived (assuming that functionality could be added > to rocksdb). > > 2) It sure feels like conceptually the pglog should be represented as a > per-pg ring buffer rather than key/value data. Maybe there are really > important reasons that it shouldn't be, but I don't currently see them. As > far as the objectstore is concerned, it seems to me like there are valid > reasons to provide some kind of log interface and perhaps that should be > used for pg_log. That sort of opens the door for different object store > implementations fulfilling that functionality in whatever ways the author > deems fit. Of these two competing thoughts, I firmly believe that Thought 2 should kill and eat Thought 1. Given that SeaStore or whatever that will be NVMe optimized won't even use RocksDB, we definitely don't want to depend on RocksDB behavior in the long term. Also I'm with you that it makes sense, intuitively, if we have some concept of 'log' that the ObjectStore is responsible for keeping track of and make it explicit. -- Senior Software Engineer Red Hat Storage, Ann Arbor, MI, US IRC: Aemerson@OFTC, Actinic@Freenode 0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C 7C12 80F7 544B 90ED BFB9 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html