Help diagnosing lvm ioctl log spam

Fabricio Winter <admin@xxxxxxxxxx> · Wed, 9 Oct 2024 06:28:26 -0300

Hello people, we have been experiencing an issue with lvm2-thin on
_some_ of our production servers where out of nowhere
lvm2/device-mapper starts spamming error logs and I can't really seem
to trace down the root cause.

This is what the logs look like;
Oct  9 06:25:02 U5bW8JT7 lvm[8020]: device-mapper: waitevent ioctl on
LVM-CP5Gw8QrWLqwhBcJL87R1mc9Q9KTBtQQmOowipTAFuM7hqzHz6pRVvUaNO9FGzeq-tpool
failed: Inappropriate ioctl for device
Oct  9 06:25:02 U5bW8JT7 lvm[8020]: waitevent: dm_task_run failed:
Inappropriate ioctl for device

It writes this to rsyslog at speeds so fast not even tail can keep up
over ssh. I'm really lost as to the reason or how to trace this back
to a process. We use lvm-thin to host virtual machines via libvirt, so
several thin disks under a thin pool one for each virtual machine. The
lvm2 pool is created on top of a vg that is on top of a md-raid1
device.

/dev/md127:
           Version : 1.2
     Creation Time : Wed May  5 16:56:09 2021
        Raid Level : raid1
        Array Size : 1953382464 (1862.89 GiB 2000.26 GB)
     Used Dev Size : 1953382464 (1862.89 GiB 2000.26 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Oct  9 07:31:44 2024
             State : active
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name :  U5bW8JT7:1  (local to host  U5bW8JT7)
              UUID : 5072120c:b114a0aa:51438cf8:8ebe58b8
            Events : 70525

    Number   Major   Minor   RaidDevice State
       0     259        0        0      active sync   /dev/nvme0n1
       1     259        1        1      active sync   /dev/nvme1n1

I do not know what this
`CP5Gw8QrWLqwhBcJL87R1mc9Q9KTBtQQmOowipTAFuM7hqzHz6pRVvUaNO9FGzeq-tpool`
disk is on device mapper, I'm unable to find it, however the tpool
suffix suggests it's the thin pool so here is what lvdisplay says
about it

--- Logical volume ---
  LV Name                lvol1
  VG Name                lightning-nvme
  LV UUID                mOowip-TAFu-M7hq-zHz6-pRVv-UaNO-9FGzeq
  LV Write Access        read/write
  LV Creation host, time U5bW8JT7, 2021-05-05 16:57:13 +0000
  LV Pool metadata       lvol1_tmeta
  LV Pool data           lvol1_tdata
  LV Status              available
  # open                 72
  LV Size                <1.80 TiB
  Allocated pool data    81.30%
  Allocated metadata     49.28%
  Current LE             471040
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2

The start of the device name suggests it's part of the vg so here's
also what vgdisplay says:

--- Volume group ---
  VG Name               lightning-nvme
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  52521
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                72
  Open LV               70
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               <1.82 TiB
  PE Size               4.00 MiB
  Total PE              476899
  Alloc PE / Size       471098 / <1.80 TiB
  Free  PE / Size       5801 / 22.66 GiB
  VG UUID               CP5Gw8-QrWL-qwhB-cJL8-7R1m-c9Q9-KTBtQQ

Any tips on how to trace this back to a process or a cause or a bug or
anything are highly appreciated. I literally only have these two log
lines to go by, everything else is working as expected until the
server crashes from running out of disk space from these logs.
Restarting the node fixes the issue/log spam until it starts happening
again randomly after several days.

Runnig Ubuntu 18.04.5 LTS - lvm2 2.02.176-4.1ubuntu3.18.04.3

-
--Fabricio Winter