Hello, On Sun, 4 Aug 2019 06:34:46 -0500 Mark Nelson wrote: > On 8/4/19 6:09 AM, Paul Emmerich wrote: > > > On Sun, Aug 4, 2019 at 3:47 AM Christian Balzer <chibi@xxxxxxx> wrote: > > > >> 2. Bluestore caching still broken > >> When writing data with the fios below, it isn't cached on the OSDs. > >> Worse, existing cached data that gets overwritten is removed from the > >> cache, which while of course correct can't be free in terms of allocation > >> overhead. > >> Why not doing what any sensible person would expect from experience with > >> any other cache there is, cache writes in case the data gets read again > >> soon and in case of overwrites use existing allocations. > > This is by design. > > The BlueStore only populates its cache on reads, not on writes. The idea is > > that a reasonable application does not read data it just wrote (and if it does > > it's already cached at a higher layer like the page cache or a cache on the > > hypervisor). > > > Note that this behavior can be change by setting > bluestore_default_buffered_write = true. > Thanks to Mark for his detailed reply. Given the points I assume that with HDD backed (but SSD WAL/DB) OSDs it's not actually a performance killer? I'll test that of course but a gut feeling or ball park would be appreciated by probably more people that me. As Paul's argument, I'm not buying it because: - It's a complete paradigm change when comparing it to filestore. Somebody migrating from FS to BS is likely to experience yet another performance decrease they didn't expect. - Arguing for larger caches on the client only increases the cost of Ceph further. In that vein, BS currently can't utilize as much memory as FS did for caching in a save manner. - Use cases like a DB with enough caching to deal with the normal working set but doing some hourly crunching on data that exceeds come to mind. One application here also processes written data once an hour, more than would fit in the VM pagecache, but currently comes from the FS pagecache. - The overwrites of already cached date _clearly_ indicate a hotness and thus should be preferably cached. That bit in particular is upsetting, initial write caching or not. Regards, Christian > > FWIW, there's also a CPU usage and lock contention penalty for default > buffered write when using extremely fast flash storage. A lot of my > recent work on improving cache performance and intelligence in bluestore > is to reduce contention in the onode/buffer cache and also significantly > reduce the impact of default buffered write = true. The > PriorityCacheManger was a big one to do a better job of autotuning. > Another big one that recently merged was refactoring bluestore's caches > to trim on write (better memory behavior, shorter more frequent trims, > trims distributed across threads) and not share a single lock between > the onode and buffer cache: > > > https://github.com/ceph/ceph/pull/28597 > > > Ones still coming down the pipe are to avoid double caching onodes in > the bluestore onode cache and rocksdb block cache and age-binning the > LRU caches to better redistribute memory between caches based on > relative age. This is the piece that hopefully would let you cache on > write while still having the priority of those cached writes quickly > fall off if they are never read back (the more cache you have, the more > effective this would be at keeping the onode/omap ratios relatively higher). > > > Mark > > > > > > > > > > Paul > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Mobile Inc. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com