Re: [PATCH v2] bcache: allow allocator to invalidate bucket in gc

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all

We found this patch via google.

We have a setup that uses bcache to cache a network attached storage in a local SSD drive. Under heavy traffic, IO on the cached device stalls every hour or so for tens of seconds. When we track the latency with "fio" utility continuously, we can see the max IO latency shoots up when stall happens,  

latency_test: (groupid=0, jobs=1): err= 0: pid=50416: Fri Mar 15 21:14:18 2024
  read: IOPS=62.3k, BW=486MiB/s (510MB/s)(11.4GiB/24000msec)
    slat (nsec): min=1377, max=98964, avg=4567.31, stdev=1330.69
    clat (nsec): min=367, max=43682, avg=429.77, stdev=234.70
     lat (nsec): min=1866, max=105301, avg=5068.60, stdev=1383.14
    clat percentiles (nsec):
     |  1.00th=[  386],  5.00th=[  406], 10.00th=[  406], 20.00th=[  410],
     | 30.00th=[  414], 40.00th=[  414], 50.00th=[  414], 60.00th=[  418],
     | 70.00th=[  418], 80.00th=[  422], 90.00th=[  426], 95.00th=[  462],
     | 99.00th=[  652], 99.50th=[  708], 99.90th=[ 3088], 99.95th=[ 5600],
     | 99.99th=[11328]
   bw (  KiB/s): min=318192, max=627591, per=99.97%, avg=497939.04, stdev=81923.63, samples=47
   iops        : min=39774, max=78448, avg=62242.15, stdev=10240.39, samples=47
...
 
<IO stall>

latency_test: (groupid=0, jobs=1): err= 0: pid=50416: Fri Mar 15 21:21:23 2024
  read: IOPS=26.0k, BW=203MiB/s (213MB/s)(89.1GiB/448867msec)
    slat (nsec): min=958, max=40745M, avg=15596.66, stdev=13650543.09
    clat (nsec): min=364, max=104599, avg=435.81, stdev=302.81
     lat (nsec): min=1416, max=40745M, avg=16104.06, stdev=13650546.77
    clat percentiles (nsec):
     |  1.00th=[  378],  5.00th=[  390], 10.00th=[  406], 20.00th=[  410],
     | 30.00th=[  414], 40.00th=[  414], 50.00th=[  418], 60.00th=[  418],
     | 70.00th=[  418], 80.00th=[  422], 90.00th=[  426], 95.00th=[  494],
     | 99.00th=[  772], 99.50th=[  916], 99.90th=[ 3856], 99.95th=[ 5920],
     | 99.99th=[10816]
   bw (  KiB/s): min=    1, max=627591, per=100.00%, avg=244393.77, stdev=103534.74, samples=765
   iops        : min=    0, max=78448, avg=30549.06, stdev=12941.82, samples=765

When we track per-second max latency in fio, we see something like this:

<time-ms>,<max-latency-ns>,,,
...
777000, 5155548, 0, 0, 0
778000, 105551, 1, 0, 0
802615, 24276019570, 0, 0, 0
802615, 82134, 1, 0, 0
804000, 9944554, 0, 0, 0
805000, 7424638, 1, 0, 0

fio --time_based --runtime=3600s --ramp_time=2s --ioengine=libaio --name=latency_test --filename=fio --bs=8k --iodepth=1 --size=900G  --readwrite=randrw --verify=0 --filename=fio --write_lat_log=lat --log_avg_msec=1000 --log_max_value=1
 
We saw a smiliar issue reported in https://www.spinics.net/lists/linux-bcache/msg09578.html, which suggests an issue in garbage collection. When we trigger GC manually via "echo 1 > /sys/fs/bcache/a356bdb0-...-64f794387488/internal/trigger_gc", the stall is always reproduced. That thread points to this patch (https://www.spinics.net/lists/linux-bcache/msg08870.html) that we tested and the stall no longer happens.

AFAIK, this patch marks buckets reclaimable at the beginning of GC to unblock the allocator so it does not need to wait for GC to finish. This periodic stall is a serious issue. Can the community look at this issue and this patch if possible?

We are running Linux kernel version 5.10 and 6.1.

Thank you.





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux