On 2017/10/2 上午1:34, Michael Lyle wrote: > On Sun, Oct 1, 2017 at 10:23 AM, Coly Li <i@xxxxxxx> wrote: >> Hi Mike, >> >> Your data set is too small. Normally bcache users I talk with, they use >> bcache for distributed storage cluster or commercial data base, their >> catch device is large and fast. It is possible we see different I/O >> behaviors because we use different configurations. > > A small dataset is sufficient to tell whether the I/O subsystem is > successfully aggregating sequential writes or not. :P It doesn't > matter whether the test is 10 minutes or 10 hours... The writeback > stuff walks the data in order. :P Hi Mike, I test your patch4,5 all these days, it turns out that your patch works better when dirty data is full on cache device. And I can say it works perfectly when dirty data is full on cache device and backing cached device is a single spinning hard disk. It is not because your data set is full, it is because when your data set is small, the cache can be close to full state, then there is more possibility to have adjacent dirty data blocks to writeback. In the best case of your patch4,5, dirty data full on cache device and cached device is a single spinning hard disk, writeback performance can be 2x faster when there is no front end I/O. See one of the performmance data (the lower the better) http://blog.coly.li/wp-content/uploads/2017/10/existing_dirty_data_on_cache_single_disk_1T_full_cache.png When the backing cached device gets faster and faster, your patch4,5 performs less and less advantage. For same backing cached device size, when cache device gets smaller and smaller, your patch4,5 performs less and less advantage. And in the following configuration I find current bcache code performs better (not too much) then your patch4,5 reorder method, - cached device: A md linear device combined by 2x1.8T hard disks - cache device: A 1800G NVMe SSD - fio rand write blocksize 512K - dirty data occupies 50% space of cache device (900G from 1800G) One of the performance data can be found here, http://blog.coly.li/wp-content/uploads/2017/10/existing_dirty_data_on_ssd_900_1800G_cache_half.png > > ***We are measuring whether the cache and I/O scheduler can correctly > order up-to-64-outstanding writebacks from a chunk of 500 dirty > extents-- we do not need to do 12 hours of writes first to measure > this.*** > > It's important that there be actual contiguous data, though, or the > difference will be less significant. If you write too much, there > will be a lot more holes in the data from writeback during the test > and from writes bypassing the cache. > I see, your patches do perform better when dirty data are contiguous on SSD. But we should know how much the probability in real world this assumption can be real. Especially in some cases, your patches make writeback performance slower than current bcache code does. To test your patches, the following backing devices are used, - md raid5 device composed by 4 hard disks - md linear device composed by 2 hard disks - md raid0 devices composed by 4 hard disks - single 250G SATA SSD - single 1.8T hard disk And the following cache devices are used, - 3.8T NVMe SSD - 1.8T NVMe SSD partition - 232G NVMe SSD partition > Having all the data to writeback be sequential is an > artificial/synthetic condition that allows the difference to be > measured more easily. It's about a 2x difference under these > conditions in my test environment. I expect with real data that is > not purely sequential it's more like a few percent. >From my test, it seems in the following situation your patches4,5 works better, 1) backing cached device is slow 2.1) higher percentage of (cache_device_size/cached_device_size), this means dirty data on cache device has more probability to be contiguous. or 2.2) dirty data on cache device is almost full This is what I observe in these days testing. I will continue to try tomorrow to see in which percentage of dirty data on cache device when current bcache code performs worse than your patch. So far I see when cache is 50% full, and cache device is half size of cached device, your patch4,5 won't have performance advantage, in my testing environment. All performance comparison png files are too big, once I finish the final benchmark, I will combine them into a pdf file and share a link. Thanks. -- Coly Li