On Wed, Jul 22, 2020 at 2:55 PM Vojtech Myslivec <vojtech@xxxxxxxxxxxx> wrote: > > This host serves as a backup server and it runs regular backup tasks. > When a backup is performed, one (read only) snapshot of one of my > subvolumes on the btrfs filesystem is created and one snapshot is > deleted afterwards. This is likely to be a decently metadata centric workflow, lots of small file changes, and metadata (file system) changes. Parity raid performance in such workloads is often not great. It's just the way it goes. But what does iostat tell you about drive utilization during these backups? And during the problem? Are they balanced? Are they nearly fully utilized? > > Once in several days (irregularly, as I noticed), the `md1_raid6` > process starts to consume 100 % of one CPU core and during that time, > creating a snapshot (during the regular backup process) of a btrfs > subvolume get stucked. User space processes accessing this particular > subvolume then start to hang in *disk sleep* state. Access to other > subvolumes seems to be unaffected until another backup process tries to > create another snapshot (of different subvolume). Snapshot results in flush. And snapshot delete result in btrfs-cleaner process, which involves a lot of reas and writes to track down the extents to be freed. But your call traces seem stuck in snapshot creation. Can you provide mdadm -E and -D output respectively? I wonder if the setup is just not well suited for the workload. Default mdadm 512KiB chunk may not align well with this workload. Also, a complete dmesg might be useful. > > In most cases, after several "IO" actions like listing files (ls), > accessing btrfs information (`btrfs filesystem`, `btrfs subvolume`), or > accessing the device (with `dd` or whatever), the filesystem gets > magically unstucked and `md1_raid6` process released from its "live > lock" (or whatever it is cycled in). Snapshots are then created as > expected and all processes finish their job. > > Once in a week approximately, it takes tens of minutes to unstuck these > processes. During that period, I try to access affected btrfs subvolumes > in several shell sessions to wake it up. Could be lock contention on the subvolume. > However, there are some more "blocked" tasks, like `btrfs` and > `btrfs-transaction` with call trace also included. > > > Questions > ========= > > 1. What should be the cause of this problem? > 2. What should I do to mitigate this issue? > 3. Could it be a hardware problem? How can I track this? Not sure yet. Need more info. dmesg mdadm -E mdadm -D btrfs filesystem usage /mountpoint btrfs device stats /mountpoint > What I have done so far > ======================= > > - I keep the system up-to-date, with latest stable kernel provided by > Debian packages 5.5 is fairly recent and OK. It should be fine, except you're having a problem, so...it could be a bug that's fixed already or a new bug. Or it could be suboptimal configuration for the workload - which can be difficult to figure out. > > - I run both `btrfs scrub` and `fsck.btrfs` to exclude btrfs filesystem > issue. > > - I have read all the physical disks (with dd command) and perform SMART > self tests to exclude disks issue (though read/write badblocks were > not checked yet). I wouldn't worry too much about badblocks. More important is https://raid.wiki.kernel.org/index.php/Timeout_Mismatch But you report using enterprise drives. They will invariably have an SCT ERC time of ~70 deciseconds, which is well below that of the kernel's SCSI command timer, ergo not a problem. But it's fine to double check that. > - I have also moved all the files out of the affected filesystem, create > a new btrfs filesystem (with recent btrfs-progs) and moved files > back. This issue, none the less, appeared again. Exactly the same configuration? Anything different at all? > > - I have tried to attach strace to cycled md1 process, but > unsuccessfully (is it even possible to strace running kernel thread?) You want to do 'cat /proc/<pid>/stack' > Some detailed facts > =================== > > OS > -- > > - Debian 10 buster (stable release) > - Linux kernel 5.5 (from Debian backports) > - btrfs-progs 5.2.1 (from Debian backports) btrfs-progs 5.2.1 is ok, but I suggest something newer before using 'btrfs check --repair'. Just to be clear --repair is NOT indicated right now. > Hardware > -------- > > - 8 core/16 threads amd64 processor (AMD EPYC 7251) > - 6 SATA HDD disks (Seagate Enterprise Capacity) > - 2 SSD disks (Intel D3-S4610) It's not related, but your workload might benefit from 'compress=zstd:1' mount option. Compress everything across the board. Chances are these backups contain a lot of compressible data. This isn't important to do right now. Fix the problem first. Optimize later. But you have significant CPU capacity relative to the hardware. > btrfs > ----- > - Several subvolumes, tens of snapshots > - Default mount options: rw,noatime,space_cache,subvolid=5,subvol=/ > - No compression, autodefrag or so > - I have tried to use quotas in the past but they are disabled for > a long time I don't think this is the only thing going on, but consider space_cache=v2. You can mount with '-o clear_cache' then umount and then mount again with 'o space_cache=v2' to convert. And it will be persistent (unless invalidated by a repair and then default v1 version is used again). v2 will soon be the default. > > Usage > ----- > > - Affected RAID6 block device is directly formatted to btrfs > - This filesystem is used to store backups > - Backups are performed via rsnapshot > - rsnapshot is configured to use btrfs snapshots for hourly and daily > backups and rsync to copy new backups How many rsnapshot and rsync tasks are happening concurrently for a subvolume at the time the subvolume becomes unresponsive? -- Chris Murphy