On Thu, May 28, 2020 at 3:18 AM Michal Soltys <msoltyspl@xxxxxxxxx> wrote: > > On 5/27/20 3:36 PM, Michal Soltys wrote: > > On 5/27/20 1:37 AM, Song Liu wrote: > >> [...] > To expand on the above (mdadm was started yesterday): > > - it went on for ~19+ hours - like previously > - the last piece in trace output: > > xs22:/home/msl☠ tail -n 4 trace.out ; echo > 09:08:12 40739 40739 mdadm r5l_recovery_read_page 1852d8b0 > 09:08:12 40739 40739 mdadm r5l_recovery_read_page 1852d8b8 > 09:08:12 40739 40739 mdadm r5l_recovery_read_page 1852d8c0 > 09:08:12 40739 40739 mdadm r5l_recovery_read_page 1852d8c8 > > So it went through ~120gb of journal (the journal device is 256gb). This (processing more than 10GB) should not happen. So there must be something wrong... > Non-assembling raid issue aside (assuming it's unrelated) - why does it > take such enormous amount of time ? That's average of like 1.7mb/s This is slow because we are processing 4kB at a time. But it is slower than my expectation. What are the speeds measured at boot? Something like the following in dmesg: [ 3.851064] raid6: avx2x4 gen() 17328 MB/s [ 3.877404] raid6: avx2x4 xor() 11033 MB/s [ 3.903410] raid6: avx2x2 gen() 14710 MB/s [ 3.929410] raid6: avx2x2 xor() 8921 MB/s [ 3.955408] raid6: avx2x1 gen() 12687 MB/s [ 3.981405] raid6: avx2x1 xor() 8414 MB/s [ 4.007409] raid6: sse2x4 gen() 9105 MB/s [ 4.033403] raid6: sse2x4 xor() 5724 MB/s [ 4.059408] raid6: sse2x2 gen() 7570 MB/s [ 4.085410] raid6: sse2x2 xor() 4880 MB/s [ 4.111409] raid6: sse2x1 gen() 6484 MB/s [ 4.137402] raid6: sse2x1 xor() 4412 MB/s For the next step, could you please try the following the beginning of --assemble run? trace.py -M 10 'r::r5l_recovery_verify_data_checksum()(retval != 0)' Thanks, Song PS: Some key functions here are inlined and thus not easily traceable. We will probably need to patch the kernel for further debugging. > > Around 09:08 today, the recovery line popped in the dmesg as previously: > > [May27 12:58] md: md124 stopped. > [ +0.006779] md/raid:md124: device sdf1 operational as raid disk 0 > [ +0.006929] md/raid:md124: device sdj1 operational as raid disk 3 > [ +0.006903] md/raid:md124: device sdi1 operational as raid disk 2 > [ +0.006914] md/raid:md124: device sdh1 operational as raid disk 1 > [ +0.007426] md/raid:md124: raid level 5 active with 4 out of 4 > devices, algorithm 2 > [May28 09:08] md/raid:md124: recovering 24667 data-only stripes and > 50212 data-parity stripes