Re: [PATCH v2] bcache: allow allocator to invalidate bucket in gc

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> 2024年3月17日 13:41,Robert Pang <robertpang@xxxxxxxxxx> 写道:
> 
> Hi Coly
> 

Hi Robert,

> Thank you for looking into this issue.
> 
> We tested this patch in 5 machines with local SSD size ranging from
> 375 GB to 9 TB, and ran tests for 10 to 12 hours each. We observed no
> stall nor other issues. Performance was comparable before and after
> the patch. Hope this info will be helpful.

Thanks for the information.

Also I was told this patch has been deployed and shipped for 1+ year in easystack products, works well.

The above information makes me feel confident for this patch. I will submit it in next merge window if some ultra testing loop passes.

Coly Li


> 
> 
> On Fri, Mar 15, 2024 at 7:49 PM Coly Li <colyli@xxxxxxx> wrote:
>> 
>> Hi Robert,
>> 
>> Thanks for your email.
>> 
>>> 2024年3月16日 06:45,Robert Pang <robertpang@xxxxxxxxxx> 写道:
>>> 
>>> Hi all
>>> 
>>> We found this patch via google.
>>> 
>>> We have a setup that uses bcache to cache a network attached storage in a local SSD drive. Under heavy traffic, IO on the cached device stalls every hour or so for tens of seconds. When we track the latency with "fio" utility continuously, we can see the max IO latency shoots up when stall happens,
>>> 
>>> latency_test: (groupid=0, jobs=1): err= 0: pid=50416: Fri Mar 15 21:14:18 2024
>>> read: IOPS=62.3k, BW=486MiB/s (510MB/s)(11.4GiB/24000msec)
>>>   slat (nsec): min=1377, max=98964, avg=4567.31, stdev=1330.69
>>>   clat (nsec): min=367, max=43682, avg=429.77, stdev=234.70
>>>    lat (nsec): min=1866, max=105301, avg=5068.60, stdev=1383.14
>>>   clat percentiles (nsec):
>>>    |  1.00th=[  386],  5.00th=[  406], 10.00th=[  406], 20.00th=[  410],
>>>    | 30.00th=[  414], 40.00th=[  414], 50.00th=[  414], 60.00th=[  418],
>>>    | 70.00th=[  418], 80.00th=[  422], 90.00th=[  426], 95.00th=[  462],
>>>    | 99.00th=[  652], 99.50th=[  708], 99.90th=[ 3088], 99.95th=[ 5600],
>>>    | 99.99th=[11328]
>>>  bw (  KiB/s): min=318192, max=627591, per=99.97%, avg=497939.04, stdev=81923.63, samples=47
>>>  iops        : min=39774, max=78448, avg=62242.15, stdev=10240.39, samples=47
>>> ...
>>> 
>>> <IO stall>
>>> 
>>> latency_test: (groupid=0, jobs=1): err= 0: pid=50416: Fri Mar 15 21:21:23 2024
>>> read: IOPS=26.0k, BW=203MiB/s (213MB/s)(89.1GiB/448867msec)
>>>   slat (nsec): min=958, max=40745M, avg=15596.66, stdev=13650543.09
>>>   clat (nsec): min=364, max=104599, avg=435.81, stdev=302.81
>>>    lat (nsec): min=1416, max=40745M, avg=16104.06, stdev=13650546.77
>>>   clat percentiles (nsec):
>>>    |  1.00th=[  378],  5.00th=[  390], 10.00th=[  406], 20.00th=[  410],
>>>    | 30.00th=[  414], 40.00th=[  414], 50.00th=[  418], 60.00th=[  418],
>>>    | 70.00th=[  418], 80.00th=[  422], 90.00th=[  426], 95.00th=[  494],
>>>    | 99.00th=[  772], 99.50th=[  916], 99.90th=[ 3856], 99.95th=[ 5920],
>>>    | 99.99th=[10816]
>>>  bw (  KiB/s): min=    1, max=627591, per=100.00%, avg=244393.77, stdev=103534.74, samples=765
>>>  iops        : min=    0, max=78448, avg=30549.06, stdev=12941.82, samples=765
>>> 
>>> When we track per-second max latency in fio, we see something like this:
>>> 
>>> <time-ms>,<max-latency-ns>,,,
>>> ...
>>> 777000, 5155548, 0, 0, 0
>>> 778000, 105551, 1, 0, 0
>>> 802615, 24276019570, 0, 0, 0
>>> 802615, 82134, 1, 0, 0
>>> 804000, 9944554, 0, 0, 0
>>> 805000, 7424638, 1, 0, 0
>>> 
>>> fio --time_based --runtime=3600s --ramp_time=2s --ioengine=libaio --name=latency_test --filename=fio --bs=8k --iodepth=1 --size=900G  --readwrite=randrw --verify=0 --filename=fio --write_lat_log=lat --log_avg_msec=1000 --log_max_value=1
>>> 
>>> We saw a smiliar issue reported in https://www.spinics.net/lists/linux-bcache/msg09578.html, which suggests an issue in garbage collection. When we trigger GC manually via "echo 1 > /sys/fs/bcache/a356bdb0-...-64f794387488/internal/trigger_gc", the stall is always reproduced. That thread points to this patch (https://www.spinics.net/lists/linux-bcache/msg08870.html) that we tested and the stall no longer happens.
>>> 
>>> AFAIK, this patch marks buckets reclaimable at the beginning of GC to unblock the allocator so it does not need to wait for GC to finish. This periodic stall is a serious issue. Can the community look at this issue and this patch if possible?
>>> 
>> 
>> Could you please share more performance information of this patch? And how many nodes/how long time does the test cover so far?
>> 
>> Last time I test the patch, it looked fine. But I was not confident how large scale and how long time this patch was tested. If you may provide more testing information, it will be helpful.
>> 
>> 
>> Coly Li
> 






[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux