Re: problem with recovered array

Roger Heflin <rogerheflin@xxxxxxxxx> · Wed, 1 Nov 2023 09:29:25 -0500

Did you test writing on the array?    And if you did test it did you
use a big enough file and/or bypass/disable cache?  Having the default
dirty* settings high/default it can cache up to 20% of RAM (without
writing it to disk).  Raid5/6 writing is much slower than reads.  If
you have a single raid5/6 array of spinning disks I would be surprised
if it was able to sustain anything close to 175MB/sec.     Mine
struggles to sustain even 100MB/sec on writes.

But (with my 3M/5Mbyte _bytes setting (lets the cache work, but
prevents massive amounts of data in cache).
vm.dirty_background_bytes = 3000000
vm.dirty_bytes = 5000000
dd if=/dev/zero bs=256K of=testfile.out count=4000 status=progress
501481472 bytes (501 MB, 478 MiB) copied, 8 s, 62.5 MB/s^C
2183+0 records in
2183+0 records out
572260352 bytes (572 MB, 546 MiB) copied, 8.79945 s, 65.0 MB/s

Using the fedora default:
vm.dirty_background_ratio = 20
vm.dirty_ratio = 20
dd if=/dev/zero bs=256K of=testfile.out count=4000 status=progress
4000+0 records in
4000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 0.687841 s, 1.5 GB/s

Note that on the 2nd test writing was still happening for the next
5-8seconds after it exited, giving an appearance of being much faster
than it is.  As the test size is increased the speed gets lower as
more of the data has to get written.  On my system with the fedora
default settings (64G ram) with a 16G file I am still getting
300MB/sec even though the underlying array can only really do around
70-100MB/sec.

Raid5/6 typically breaks down to about the write rate of a single disk
because of all of the extra work that has to be done for raid5/6 to
work.

Do you happen to know how fragmented the 8gb vm file is?   (filefrag
<filename>).   each fragment requires a separate several ms seek.

On Wed, Nov 1, 2023 at 8:08 AM <eyal@xxxxxxxxxxxxxx> wrote:
>
> On 01/11/2023 21.30, Roger Heflin wrote:
> > Did you check with iostat/sar?    (iostat -x 2 5).  The md
> > housekeeping/background stuff does not show on the md device itself,
> > but it shows on the underlying disks.
>
> Yes, I have iostat on both the md device and the components and I see sparse activity on both.
>
> > It might also be related the the bitmap keeping track of how far the
> > missing disk is behind.
>
> This may be the case. In case the disk is re-added it may help to know this,
> but if a new disk is added then the full disk will be recreated anyway.
>
> > Small files are troublesome.    A tiny file takes several reads and writes.
>
> Yes, but the array is so fast that this should not be a problem.
> The rsync source is 175MB/s and the target array is 800MB/s so I do not
> see how the writing can slow the copy.
>
> In one case I (virsh) saved a VM, which created one 8GB file, which took
> many hours to be written.
>
> > I think the bitmap is tracking how many writes need to be done on the
> > missing disk, and if so then until the new disk gets put back will not
> > start cleaning up.
> >
> >
> > On Tue, Oct 31, 2023 at 4:40 PM <eyal@xxxxxxxxxxxxxx> wrote:
> >>
> >> On 31/10/2023 21.24, Roger Heflin wrote:
> >>> If you directory entries are large (lots of small files in a
> >>> directory) then the recovery of the missing data could be just enough
> >>> to push your array too hard.
> >>
> >> Nah, the directory I am copying has nothing really large, and the target directory is created new.
> >>
> >>> find /<mount> -type d -size +1M -ls     will find large directories.
> >>>
> >>> do a ls -l <largedirname> | wc -l and see how many files are in there.
> >>>
> >>> ext3/4 has issues with really big directories.
> >>>
> >>> The perf top showed just about all of the time was being spend in
> >>> ext3/4 threads allocating new blocks/directory entries and such.
> >>
> >> Just in case there is an issue, I will copy another directory as a test.
> >> [later] Same issue. This time the files were Pictures, 1-3MB each, so it went faster (but not as fast as the array can sustain).
> >> After a few minutes (9GB copied) it took a long pause and a second kworker started. This one gone after I killed the copy.
> >>
> >> However, this same content was copied from an external USB disk (NOT to the array) without a problem.
> >>
> >>> How much free space does the disk show in df?
> >>
> >> Enough  room:
> >>          /dev/md127       55T   45T  9.8T  83% /data1
> >>
> >> I still suspect an issue with the array after it was recovered.
> >>
> >> A replated issue is that there is a constant rate of writes to the array (iostat says) at about 5KB/s
> >> when there is no activity on this fs. In the past I saw zero read/write in iostat in this situation.
> >>
> >> Is there some background md process? Can it be hurried to completion?
> >>
> >>> On Tue, Oct 31, 2023 at 4:29 AM <eyal@xxxxxxxxxxxxxx> wrote:
> >>>>
> >>>> On 31/10/2023 14.21, Carlos Carvalho wrote:
> >>>>> Roger Heflin (rogerheflin@xxxxxxxxx) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
> >>>>>> look at  SAR -d output for all the disks in the raid6.   It may be a
> >>>>>> disk issue (though I suspect not given the 100% cpu show in raid).
> >>>>>>
> >>>>>> Clearly something very expensive/deadlockish is happening because of
> >>>>>> the raid having to rebuild the data from the missing disk, not sure
> >>>>>> what could be wrong with it.
> >>>>>
> >>>>> This is very similar to what I complained some 3 months ago. For me it happens
> >>>>> with an array in normal state. sar shows no disk activity yet there are no
> >>>>> writes to the array (reads happen normally) and the flushd thread uses 100%
> >>>>> cpu.
> >>>>>
> >>>>> For the latest 6.5.* I can reliably reproduce it with
> >>>>> % xzcat linux-6.5.tar.xz | tar x -f -
> >>>>>
> >>>>> This leaves the machine with ~1.5GB of dirty pages (as reported by
> >>>>> /proc/meminfo) that it never manages to write to the array. I've waited for
> >>>>> several hours to no avail. After a reboot the kernel tree had only about 220MB
> >>>>> instead of ~1.5GB...
> >>>>
> >>>> More evidence that the problem relates to the cache not flushed to disk.
> >>>>
> >>>> If I run 'rsync --fsync ...' it slows it down as the writing is flushed to disk for each file.
> >>>> But it also evicts it from the cache, so nothing accumulates.
> >>>> The result is a slower than otherwise copying but it streams with no pauses.
> >>>>
> >>>> It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
> >>>> only a few large files. rsync copies a very large number of small files which somehow triggers
> >>>> the problem.
> >>>>
> >>>> This is why my 'dd if=/dev/zero of=file-on-array' goes fast without problems.
> >>>>
> >>>> Just my guess.
> >>>>
> >>>> BTW I ran fsck on the fs (on the array) and it found no fault.
> >>>>
> >>>> --
> >>>> Eyal at Home (eyal@xxxxxxxxxxxxxx)
> >>>>
> >>
> >> --
> >> Eyal at Home (eyal@xxxxxxxxxxxxxx)
> >>
>
> --
> Eyal at Home (eyal@xxxxxxxxxxxxxx)
>