On Thu, Nov 30, 2017 at 11:22:23AM -0500, Larkin Lowrey wrote: > Seems locks are involved, and in a not so nice way. > > I had been chasing a problem that I thought was with the SAS controllers > where the IO for the drives would stall for extended periods of time. The > data from iostat would show 100% utilization but zero I/O and 1 or 2 > avgqu-sz. Once the IO would resume the await value would be exactly the > amount of time the IO had been stalled. I was sure it was the SAS controller > or driver but it turns out it was the raid6 journal! During the periods of > stalled IO, the mdX_raid6 process would go to 100% CPU and when the IO > resumed that process would be back down at ~50%. When I switched back to > resync (disabled the journal), those IO stalls went away. It appears to me > that the bad locking behavior is the culprit. > > My priority now is to disable the journals on my two remaining arrays which > boot up with mdX_raid6 at 100% CPU. Because of their state mdadm is unable > to act on them at all. How can I disable the journal of these arrays which > are 'inactive'? I think we should fix the journal bug. Can you describe how you trigger this? anything abnormal happens? Thanks, Shaohua > > --Larkin > > On 11/27/2017 9:42 PM, Larkin Lowrey wrote: > > The kernel thread is spending a lot of time in handle_active_stripes but > > not staying in there since I see it calling handle_stripe and then I > > also see calls to __release_stripe then back to handle_stripe again. So, > > it's clearly exiting handle_active_stripes but getting dumped right back > > in there. Presumably, handle_active_stripes is not completing the > > necessary work and the kernel thread keeps calling it expecting the > > pending work to be completed. > > > > When the system boots up an mdadm process runs at near 100% CPU with > > sporadic reads of the journal device. After ~30s or so, the IO stops, > > the mdadm process disappears, and the mdX_raid6 kernel thread starts > > taking 100% CPU. > > > > > RIP: 0010:do_release_stripe+0x52/0x410 [raid456] > > > RSP: 0018:ffffbb7d41dcbcc0 EFLAGS: 00000017 > > > RAX: 0000000002000000 RBX: ffff8f10c819bf20 RCX: 0000000000000000 > > > RDX: 0000000000000002 RSI: ffff8f10c819bf20 RDI: ffff8f114a2ad400 > > > RBP: ffffbb7d41dcbce8 R08: ffff8f113e3dd888 R09: 0000000000000000 > > > R10: 00000000ffffffff R11: 0000000000000000 R12: ffff8f10c819bf30 > > > R13: ffff8f113e3ddb58 R14: ffff8f113e3dd800 R15: 0000000000000003 > > > FS: 0000000000000000(0000) GS:ffff8f115ed40000(0000) > > > knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 00007fb7675e1000 CR3: 0000000412cf5000 CR4: 00000000000406e0 > > > Call Trace: > > > __release_stripe+0x16/0x20 [raid456] > > > handle_active_stripes.isra.60+0x44e/0x5a0 [raid456] > > > raid5d+0x42e/0x630 [raid456] > > > > --Larkin > > > > > > On 11/24/2017 2:20 PM, Larkin Lowrey wrote: > > > After hard-rebooting, this instance (stripe_cache_active: 2) > > > assembled just fine on boot. The next time I encountered this the > > > array was 'inactive' on boot. There was a flurry of I/O initially > > > (which seems to indicate journal re-play, then the array becoming > > > 'active') but the I/O ceased without the array becoming active. > > > > > > This time... > > > > > > stripe_cache_active: 2376 > > > > > > > md125 : inactive md127p4[9](J) sdk1[2] sdl1[3] sdn1[5] sdo1[6] > > > > sdm1[4] sdj1[1] sdq1[8] sdp1[7] > > > > 31258219068 blocks super 1.2 > > > > > > > > > > # mdadm -D /dev/md125 > > > > /dev/md125: > > > > Version : 1.2 > > > > Creation Time : Thu Oct 19 10:11:35 2017 > > > > Raid Level : raid6 > > > > Used Dev Size : 18446744073709551615 > > > > Raid Devices : 8 > > > > Total Devices : 9 > > > > Persistence : Superblock is persistent > > > > > > > > Update Time : Fri Nov 24 13:41:38 2017 > > > > State : active, FAILED, Not Started > > > > Active Devices : 8 > > > > Working Devices : 9 > > > > Failed Devices : 0 > > > > Spare Devices : 0 > > > > > > > > Layout : left-symmetric > > > > Chunk Size : 64K > > > > > > > > Consistency Policy : journal > > > > > > > > Name : ########:3 > > > > UUID : de6a2ce0:1a4c510f:d7c89da4:1215a312 > > > > Events : 156844 > > > > > > > > Number Major Minor RaidDevice State > > > > - 0 0 0 removed > > > > - 0 0 1 removed > > > > - 0 0 2 removed > > > > - 0 0 3 removed > > > > - 0 0 4 removed > > > > - 0 0 5 removed > > > > - 0 0 6 removed > > > > - 0 0 7 removed > > > > > > > > - 259 3 - spare /dev/md127p4 > > > > - 8 225 5 sync /dev/sdo1 > > > > - 8 209 4 sync /dev/sdn1 > > > > - 8 193 3 sync /dev/sdm1 > > > > - 8 177 2 sync /dev/sdl1 > > > > - 8 161 1 sync /dev/sdk1 > > > > - 8 145 0 sync /dev/sdj1 > > > > - 65 1 7 sync /dev/sdq1 > > > > - 8 241 6 sync /dev/sdp1 > > > > > > --Larkin > > > > > > On 11/23/2017 1:22 PM, Larkin Lowrey wrote: > > > > Sometimes, stopping a raid6 array (with journal) hangs, the > > > > mdX_raid6 process pegs at 100% CPU, and there is no I/O. Looks > > > > like it's stuck in an infinite loop. > > > > > > > > Kernel: 4.13.13-200.fc26.x86_64 > > > > > > > > The stack trace (echo l > /proc/sysrq-trigger) is always the same: > > > > > > > > > handle_stripe+0x10c/0x2140 [raid456] > > > > > ? pick_next_task_fair+0x491/0x550 > > > > > handle_active_stripes.isra.60+0x3e5/0x5a0 [raid456] > > > > > raid5d+0x42e/0x630 [raid456] > > > > > ? prepare_to_wait_event+0x79/0x160 > > > > > md_thread+0x125/0x170 > > > > > ? md_thread+0x125/0x170 > > > > > ? finish_wait+0x80/0x80 > > > > > kthread+0x125/0x140 > > > > > ? state_show+0x2f0/0x2f0 > > > > > ? kthread_park+0x60/0x60 > > > > > ? do_syscall_64+0x67/0x140 > > > > > ret_from_fork+0x25/0x30 > > > > > > > > The array is healthy, has a journal, and writes were idle for > > > > several minutes prior to running 'mdadm --stop'. > > > > > > > > > md124 : active raid6 sdt1[6] sds1[5] sdw1[1] sdx1[2] sdy1[3] > > > > > sdu1[7] sdv1[8] sdz1[4] md125p4[9](J) > > > > > 23442092928 blocks super 1.2 level 6, 64k chunk, > > > > > algorithm 2 [8/8] [UUUUUUUU] > > > > > > > > stripe_cache_active: 2 > > > > stripe_cache_size: 32768 > > > > array_state: write-pending > > > > journal_mode: write-through [write-back] > > > > consistency_policy: journal > > > > > > > > --Larkin > > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe > > > > linux-raid" in > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html