> Il giorno 26 mar 2018, alle ore 12:28, Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> ha scritto: > > > > On 26.03.2018 11:01, Paolo Valente wrote: >>> Il giorno 21 mar 2018, alle ore 00:49, Paolo Valente <paolo.valente@xxxxxxxxxx> ha scritto: >>> >>> >>> >>>> Il giorno 20 mar 2018, alle ore 15:41, Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> ha scritto: >>>> >>>> On 20.03.2018 06:00, Paolo Valente wrote: >>>>>> Il giorno 19 mar 2018, alle ore 14:28, Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> ha scritto: >>>>>> >>>>>> On 19.03.2018 09:03, Paolo Valente wrote: >>>>>>>> Il giorno 05 mar 2018, alle ore 04:48, Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> ha scritto: >>>>>>>> >>>>>>>> Rate should never overflow or become zero because it is used as divider. >>>>>>>> This patch accumulates it with saturation. >>>>>>>> >>>>>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> >>>>>>>> --- >>>>>>>> block/bfq-iosched.c | 8 +++++--- >>>>>>>> 1 file changed, 5 insertions(+), 3 deletions(-) >>>>>>>> >>>>>>>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c >>>>>>>> index aeca22d91101..a236c8d541b5 100644 >>>>>>>> --- a/block/bfq-iosched.c >>>>>>>> +++ b/block/bfq-iosched.c >>>>>>>> @@ -2546,7 +2546,8 @@ static void bfq_reset_rate_computation(struct bfq_data *bfqd, >>>>>>>> >>>>>>>> static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq) >>>>>>>> { >>>>>>>> - u32 rate, weight, divisor; >>>>>>>> + u32 weight, divisor; >>>>>>>> + u64 rate; >>>>>>>> >>>>>>>> /* >>>>>>>> * For the convergence property to hold (see comments on >>>>>>>> @@ -2634,9 +2635,10 @@ static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq) >>>>>>>> */ >>>>>>>> bfqd->peak_rate *= divisor-1; >>>>>>>> bfqd->peak_rate /= divisor; >>>>>>>> - rate /= divisor; /* smoothing constant alpha = 1/divisor */ >>>>>>>> + do_div(rate, divisor); /* smoothing constant alpha = 1/divisor */ >>>>>>>> >>>>>>>> - bfqd->peak_rate += rate; >>>>>>>> + /* rate should never overlow or become zero */ >>>>>>> It is bfqd->peak_rate that is used as a divider, and bfqd->peak_rate doesn't risk to be zero even if the variable 'rate' is zero here. >>>>>>> So I guess the reason why you consider the possibility that bfqd->peak_rate becomes zero is because of an overflow when summing 'rate'. But, according to my calculations, this should be impossible with devices with sensible speeds. >>>>>>> These are the reasons why I decided I could make it with a 32-bit variable, without any additional clamping. Did I make any mistake in my evaluation? >>>>>> >>>>>> According to Murphy's law this is inevitable.. >>>>>> >>>>> Yep. Actually Murphy has been even clement this time, by making the >>>>> failure occur to a kernel expert :) >>>>>> I've seen couple division by zero crashes in bfq_wr_duration. >>>>>> Unfortunately logs weren't recorded. >>>>>> >>>>>>> Anyway, even if I made some mistake about the maximum possible value of the device rate, and the latter may be too high for bfqd->peak_rate to contain it, then I guess the right solution would not be to clamp the actual rate to U32_MAX, but to move bfqd->peak_rate to 64 bits. Or am I missing something else? >>>>>>>>> + bfqd->peak_rate = clamp_t(u64, rate + bfqd->peak_rate, 1, U32_MAX); >>>>>> >>>>>> 32-bit should be enough and better for division. >>>>>> My patch makes sure it never overflows/underflows. >>>>>> That's cheaper than full 64-bit/64-bit division. >>>>>> Anyway 64-bit speed could overflow too. =) >>>>>> >>>>> I see your point. Still, if the mistake is not in sizing, then you >>>>> bumped into some odd bug. In this respect, I don't like much the idea >>>>> of sweeping the dust under the carpet. So, let me ask you for a >>>>> little bit more help. With your patch applied, and thus with no risk >>>>> of crashes, what about adding, right before your clamp_t, something >>>>> like: >>>>> if (!bfqd->peak_rate) >>>>> pr_crit(<dump of all the variables involved in updating bfqd->peak_rate>); >>>>> Once the failure shows up (Murphy permitting), we might have hints to >>>>> the bug causing it. >>>>> Apart from that, I have no problem with patches that make bfq more >>>>> robust, even in a sort of black-box way. >>>> >>>> This rate estimator is already works as a black-box with smoothing and >>>> low-pass filter inside. >>> >>> Actually, it is just what you say: a low-pass filter, which works by >>> deflating two non-null quantities (the previous and the current rate >>> rate), by and adding these quantities to each other (to get the new >>> estimated rate). In addition to being larger than 0, both quantities >>> are much lower than U32_MAX, so this computation should never yield >>> zero because of an overflow. >>> >>> Even some occasional zeros read for the sampled rate (because of the >>> suspicion you express below), cannot lead to a null estimated rate. >>> In fact, this is a low-pass filter, meant to cut off occasional >>> spurious values. >>> >>> I'm insisting on this part, just because, maybe, you say that it "acts >>> as a black-box" because there is something more, which I'm overlooking >>> here, and maybe the bug is exactly in the part I'm overlooking. But, >>> if this is not the case, I'll just stop now for this point. >>> >>>> It has limitations thus this is ok to declare >>>> range of expected\sane results. >>>> >>> >>> Correct, but if the odd value you cut off is a random value that >>> follows from a bug (and I see no other possibility, in view of the >>> above arguments), then the result may still be completely wrong. For >>> example: >>> - the device may have a rate of, say, 200 MB/s >>> - the computation of the new estimated rate yields 0 >>> - you limit this result to ~0 MB/s but >0 MB/s; this avoids division >>> errors, but still leaves you with a completely wrong rate, and a >>> consequently inconsistent behavior of the scheduler >>> >>>> It would be nice to show estimated rate in sysfs to let user >>>> diagnose whenever it is wrong for a long period of time. >>> >>> Absolutely. This is actually already shown indirectly, through the >>> budget_max parameter, which is the number of sectors served at peak >>> rate during slice_sync. >>> >>>> Printing message all the time even ratelimited is a wrong idea. >>>> If this should never happens - WARN_ONCE is more than enough. >>>> >>> >>> I'm sorry, I was totally unclear here. My proposal for you is: >>> since this failure does occur in your system, please add, temporarily, the >>> printing of all involved values *all the times*, so that you can track down >>> the bug, and fix the actual problem leading to this division by zero. >>> >>> What do you think? >>> >>>> I suspect crashes might be caused by bumpy timer which was affected by smi/nmi from mce. >>> >>> Yep. The problem is that, for the estimated rate to go to zero, a >>> null value must be sampled for several consecutive times. >>> >>>> I'll try to find evidence. >>> >>> Getting to the bottom of this would be really great. >>> >>> Looking forward to your findings, >> I guess no news yet. Anyway, your bug report has pushed me to think >> over the representation of the rate. And I may have found a problem, >> which might be the cause of the zeros you have seen. >> With a little math, it comes out that, for a sampled rate below about >> 5 KB/s, 0 is stored in the variable 'rate'. The reason is that such >> low values cannot be represented with the maximum precision currently >> available in BFQ for rates. So, if these very low values are sampled >> for a few consecutive times, then the estimated peak rate may actually >> become zero. >> In the system in which you have seen null estimated rates, may the >> virtual or physical storage device be occasionally so slow? > > I've found couple crashes from qemu-kvm virtual machines. > Their disks are slow for sure. Zero rate for a minute wouldn't be a surprise. > Ok, looks like we are converging. Hoping that it is ok for you, I took the liberty to prepare a patch that only lower-bounds the estimated rate to 1, and that, to prevent future problems of this kind, adds comments to report the range of peak rates representable. I have added you as signer of the patch, being this new patch just a subset of yours. If this patch is the right solution to the failure you detected, then the latter should just go away. Thank you, Paolo > <4>[125253.240885] divide error: 0000 [#1] SMP PTI > <4>[125253.241031] Modules linked in: netconsole configfs tcp_diag inet_diag unix_diag nf_log_ipv6 nf_log_common xt_LOG xt_u32 ip6t_REJECT nf_reject_ipv6 xt_conntrack xt_multiport ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_mark xt_owner iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc 8021q garp mrp stp llc drm_kms_helper syscopyarea sysfillrect sysimgblt snd_hda_codec_generic fb_sys_fops snd_hda_intel ttm drm snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd soundcore i2c_piix4 ppdev kvm_intel kvm irqbypass input_leds parport_pc fuse lp parport xfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor > <4>[125253.242257] async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ata_generic pata_acpi floppy [last unloaded: netconsole] > <4>[125253.242537] CPU: 0 PID: 36766 Comm: push-client Not tainted 4.14.15-5 #1 > <4>[125253.242731] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 > <4>[125253.242928] task: ffff880210ed3500 task.stack: ffffc90002d78000 > <4>[125253.243092] RIP: 0010:bfq_wr_duration.isra.15.part.16+0xc/0x40 > <4>[125253.243240] RSP: 0018:ffffc90002d7b9e0 EFLAGS: 00010046 > <4>[125253.243357] RAX: 00000000000d59f8 RBX: ffff88023014af00 RCX: 000000000000001e > <4>[125253.243567] RDX: 0000000000000000 RSI: ffff8802302685a0 RDI: 0000000000000000 > <4>[125253.243785] RBP: ffff880230268400 R08: 0000000000000001 R09: 0000000000000001 > <4>[125253.244004] R10: ffff8802305341b0 R11: ffff880230268400 R12: ffff88023066cb60 > <4>[125253.244222] R13: 0000000000000000 R14: ffff8802305341b0 R15: ffffc90002d7ba88 > <4>[125253.244489] FS: 00007f74d59e6700(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 > <4>[125253.244784] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > <4>[125253.245073] CR2: 00007f78e7261258 CR3: 000000021275e000 CR4: 00000000000006f0 > <4>[125253.245406] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > <4>[125253.245696] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > <4>[125253.245961] Call Trace: > <4>[125253.246097] bfq_insert_requests+0xda7/0xf60 > <4>[125253.246313] blk_mq_sched_insert_request+0xf9/0x150 > <4>[125253.246520] blk_mq_make_request+0x1da/0x630 > <4>[125253.246692] generic_make_request+0xfb/0x2d0 > <4>[125253.246847] ? submit_bio+0x73/0x150 > <4>[125253.246984] submit_bio+0x73/0x150 > <4>[125253.247229] xfs_submit_ioend+0x80/0x1a0 [xfs] > <4>[125253.247466] xfs_do_writepage+0x4d7/0x710 [xfs] > <4>[125253.247636] ? invalid_page_referenced_vma+0x90/0x90 > <4>[125253.247906] write_cache_pages+0x23a/0x4e0 > <4>[125253.248166] ? xfs_add_to_ioend+0x240/0x240 [xfs] > <4>[125253.248345] ? dequeue_entity+0x2f4/0xa90 > <4>[125253.248596] xfs_vm_writepages+0x94/0xd0 [xfs] > <4>[125253.248793] do_writepages+0x17/0x70 > <4>[125253.249020] __filemap_fdatawrite_range+0xa7/0xe0 > <4>[125253.249269] file_write_and_wait_range+0x4c/0xa0 > <4>[125253.249514] xfs_file_fsync+0x55/0x220 [xfs] > <4>[125253.249719] do_fsync+0x38/0x60 > <4>[125253.249877] SyS_fsync+0xc/0x10 > <4>[125253.250098] entry_SYSCALL_64_fastpath+0x20/0x83 > <4>[125253.250285] RIP: 0033:0x7f74e92a6a7d > <4>[125253.250484] RSP: 002b:00007f74d59dcc90 EFLAGS: 00000293 > <4>[125253.250485] Code: ff ff 8b 07 4d 29 c8 4c 0f af c0 e9 27 ff ff ff 48 89 c8 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 06 31 d2 89 ff <48> f7 f7 48 89 c1 b8 b2 0c 00 00 48 81 f9 b2 0c 00 00 77 0f 48 > <1>[125253.251304] RIP: bfq_wr_duration.isra.15.part.16+0xc/0x40 RSP: ffffc90002d7b9e0 > > And this one definitely have timer issues. > > <4>[69620.113311] hpet1: lost 1123 rtc interrupts > <4>[69725.670214] hpet1: lost 1257 rtc interrupts > <4>[69730.301092] hpet1: lost 295 rtc interrupts > <4>[69785.817074] hpet1: lost 804 rtc interrupts > <4>[70276.099325] hpet1: lost 1138 rtc interrupts > <4>[70338.962431] hpet1: lost 1274 rtc interrupts > <4>[70687.531960] hpet1: lost 314 rtc interrupts > <4>[70708.789681] hpet1: lost 1359 rtc interrupts > <4>[70892.699792] hpet1: lost 775 rtc interrupts > <4>[70985.631630] hpet1: lost 448 rtc interrupts > <4>[71045.952024] hpet1: lost 1109 rtc interrupts > <4>[71236.781185] hpet1: lost 1216 rtc interrupts > <4>[71253.521384] hpet1: lost 1071 rtc interrupts > <4>[71477.087387] hpet1: lost 563 rtc interrupts > <4>[71568.587212] hpet1: lost 357 rtc interrupts > <4>[71837.330897] hpet1: lost 705 rtc interrupts > <4>[72029.244809] hpet1: lost 1286 rtc interrupts > <4>[72251.173477] hpet1: lost 458 rtc interrupts > <4>[72398.366782] hpet1: lost 1172 rtc interrupts > <4>[73218.363935] hpet1: lost 249 rtc interrupts > <4>[73280.472760] hpet1: lost 1225 rtc interrupts > <4>[73458.506121] hpet1: lost 398 rtc interrupts > <4>[73504.331306] hpet1: lost 183 rtc interrupts > <4>[73596.270009] hpet1: lost 386 rtc interrupts > <4>[73858.989240] hpet1: lost 320 rtc interrupts > <4>[73903.326725] hpet1: lost 88 rtc interrupts > <4>[74170.760022] hpet1: lost 622 rtc interrupts > <4>[74517.835154] hpet1: lost 216 rtc interrupts > <4>[74609.630248] hpet1: lost 375 rtc interrupts > <4>[74884.937850] hpet1: lost 1126 rtc interrupts > <4>[74886.291818] hpet1: lost 86 rtc interrupts > <4>[74902.415909] hpet1: lost 1031 rtc interrupts > <4>[75134.991339] hpet1: lost 1140 rtc interrupts > <4>[75139.948051] hpet1: lost 316 rtc interrupts > <4>[75183.709393] hpet1: lost 50 rtc interrupts > <4>[75486.178692] hpet1: lost 115 rtc interrupts > <4>[75634.770692] hpet1: lost 1261 rtc interrupts > <4>[76322.311822] hpet1: lost 18 rtc interrupts > <4>[76941.322201] hpet1: lost 1128 rtc interrupts > <4>[77341.716930] hpet1: lost 884 rtc interrupts > <4>[77347.532830] hpet1: lost 371 rtc interrupts > <4>[77348.620826] hpet1: lost 68 rtc interrupts > <4>[77362.430004] divide error: 0000 [#1] SMP PTI > <4>[77362.430139] Modules linked in: netconsole configfs tcp_diag inet_diag unix_diag nf_log_ipv6 nf_log_common xt_LOG xt_u32 ip6t_REJECT nf_reject_ipv6 xt_conntrack xt_multiport ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_mark xt_owner iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer drm_kms_helper snd syscopyarea sysfillrect sysimgblt fb_sys_fops ttm soundcore drm 8021q garp mrp stp llc ppdev i2c_piix4 kvm_intel input_leds kvm irqbypass parport_pc fuse lp parport xfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor > <4>[77362.431347] async_tx xor raid6_pq libcrc32c raid1 raid0 floppy multipath linear ata_generic pata_acpi [last unloaded: netconsole] > <4>[77362.431545] CPU: 0 PID: 39134 Comm: skynet.heartbea Not tainted 4.14.15-5 #1 > <4>[77362.431720] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 > <4>[77362.432013] task: ffff8800964adcc0 task.stack: ffffc900030bc000 > <4>[77362.432185] RIP: 0010:bfq_wr_duration.isra.15.part.16+0xc/0x40 > <4>[77362.432348] RSP: 0018:ffffc900030bfa90 EFLAGS: 00010046 > <4>[77362.432552] RAX: 00000000000d59f8 RBX: ffff8801348d6900 RCX: 000000000000001e > <4>[77362.432725] RDX: 0000000000000000 RSI: ffff8801348c1da0 RDI: 0000000000000000 > <4>[77362.432992] RBP: ffff8801348c1c00 R08: 0000000000000001 R09: 0000000000000001 > <4>[77362.433214] R10: ffff8800b9eacf30 R11: ffff8801348c1c00 R12: ffff880134decb60 > <4>[77362.433466] R13: 0000000000000000 R14: ffff8800b9eacf30 R15: ffffc900030bfb38 > <4>[77362.433696] FS: 00007fe28c721700(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000 > <4>[77362.433948] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > <4>[77362.434125] CR2: 0000000001370ec8 CR3: 0000000135b2c000 CR4: 00000000000006f0 > <4>[77362.434328] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > <4>[77362.434571] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > <4>[77362.434828] Call Trace: > <4>[77362.434997] bfq_insert_requests+0xda7/0xf60 > <4>[77362.435150] blk_mq_sched_insert_request+0xf9/0x150 > <4>[77362.435366] blk_mq_make_request+0x1da/0x630 > <4>[77362.435516] generic_make_request+0xfb/0x2d0 > <4>[77362.435701] ? submit_bio+0x73/0x150 > <4>[77362.435827] submit_bio+0x73/0x150 > <4>[77362.436100] xfs_submit_ioend+0x80/0x1a0 [xfs] > <4>[77362.436339] xfs_vm_writepages+0xa8/0xd0 [xfs] > <4>[77362.436549] do_writepages+0x17/0x70 > <4>[77362.436692] __filemap_fdatawrite_range+0xa7/0xe0 > <4>[77362.436884] filemap_write_and_wait_range+0x30/0x80 > <4>[77362.437106] xfs_setattr_size+0x121/0x380 [xfs] > <4>[77362.437313] xfs_vn_setattr+0x6b/0x90 [xfs] > <4>[77362.437472] notify_change+0x2da/0x400 > <4>[77362.437663] do_truncate+0x5b/0x90 > <4>[77362.437842] do_sys_ftruncate.constprop.18+0x10b/0x170 > <4>[77362.438033] ? SyS_write+0x66/0x90 > <4>[77362.438251] entry_SYSCALL_64_fastpath+0x20/0x83 > <4>[77362.438429] RIP: 0033:0x7fe293e8bd87 > <4>[77362.438586] RSP: 002b:00007fe28c71fb28 EFLAGS: 00000246 > <4>[77362.438587] Code: ff ff 8b 07 4d 29 c8 4c 0f af c0 e9 27 ff ff ff 48 89 c8 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 06 31 d2 89 ff <48> f7 f7 48 89 c1 b8 b2 0c 00 00 48 81 f9 b2 0c 00 00 77 0f 48 > > >> Thanks, >> Paolo >>> Paolo >>> >>>> >>>>> Thanks a lot, >>>>> Paolo >>>>>> >>>>>>>> update_thr_responsiveness_params(bfqd); >>>>>>>> >>>>>>>> reset_computation: