Interesting suggestions here, thanks everyone! I inspected the libvirt config and indeed the virtual machines are not configured properly to propagate trim (unless the libvirt default is unmap, but I'm pretty sure it's ignore), this however is also the case on the newer servers which have not seen this issue (yet?), I wonder if this could be the ultimate underlying cause (eg: after a few years of not running trim, slowed down by the eventual rotation of the lvm disks as vms are deleted and created). I'll be addressing this asap of course, however the underlying devices do support trim (they are nvmes after all) so I believe if anything ended up calling trim it would not cause an ioctl error as the operation would be supported by the device. As for dmeventd, that seems to have been the cause! I cleaned up rsyslog, did a tail on the log and gave dmeventd a restart (dmeventd -R) and the errors immediately stopped. Is dm-event merely a monitoring/alert daemon or does it have any impact on functionality (mostly wondering if restarting this service like this is a "safe" operation), and lastly, what could cause it to get into this state? I can understand a fail condition with a fast retry loop but what could cause a disk to get into this failing condition in the first place - what kind of operation was it doing that suddenly stopped working, since no IO issues were happening on the pool (at least as far as I can tell - disk reads/writes are working just fine). Is this condition something that would warrant being reported as a bug, or at least the fast-fail-loop? Once again, thanks everyone! -Fabricio Winter On Wed, Oct 9, 2024 at 9:21 PM Erwin van Londen <erwin@xxxxxxxxxxxxxxxxxx> wrote: > > Could it not be that, given the fact this is a volume carved out of a > thin pool, the underlying free space threshold is near or at capacity > and the the device mapper is more busy doing garbage collection than > actual application IO's. I've seen many nightmares with thin provisioned > volumes if this is not managed properly. > > Have you used the "fstrim" command or is the filesystem mounted with the > "-o discard" option? > > Check some setting with, for example, "lvs -o > lv_full_name,lv_health_status,lv_when_full". If it is a space issue make > sure that you have enough free space in your volume groups and configure > the "thin_pool_auto_extend_threshold" and > "thin_pool_auto_extend_percentage". Also turn on "monitoring" if that is > not enabled by default. > > Another thing that could be an issue is that the filesystem has the "-o > discard" mount option set, device mapper wants to propagate this to the > underlying hardware (or in your case a hypervisor which may also send > this through to the underlying storage array), and that is not supported > on that hardware. The "Inappropriate ioctl for device" message hints in > that direction. Has there been movements of volumes to other provisioned > disks or changes on the hypervisor? > > Cheers > Erwin > > On 10/10/24 00:52, David Teigland wrote: > > On Wed, Oct 09, 2024 at 06:28:26AM -0300, Fabricio Winter wrote: > >> Hello people, we have been experiencing an issue with lvm2-thin on > >> _some_ of our production servers where out of nowhere > >> lvm2/device-mapper starts spamming error logs and I can't really seem > >> to trace down the root cause. > >> > >> This is what the logs look like; > >> Oct 9 06:25:02 U5bW8JT7 lvm[8020]: device-mapper: waitevent ioctl on > >> LVM-CP5Gw8QrWLqwhBcJL87R1mc9Q9KTBtQQmOowipTAFuM7hqzHz6pRVvUaNO9FGzeq-tpool > >> failed: Inappropriate ioctl for device > >> Oct 9 06:25:02 U5bW8JT7 lvm[8020]: waitevent: dm_task_run failed: > >> Inappropriate ioctl for device > > It appears related to dmeventd monitoring the thin pools, and the kernel > > returning ENOTTY when dmeventd does the DM_DEV_WAIT ioctl. Maybe there's > > a fast retry loop in dmeventd on that error case rather than quitting. > > I wonder if there's a way you could kill dmeventd when this happens. > > > > Dave > > > > >