Hi,
On 2024/06/04 15:46, Stuart D Gathman wrote:
On Tue, 4 Jun 2024, Roger Heflin wrote:
My experience is that heavy disk io/batch disk io systems work better
with these values being smallish.
I don't see a use case for having large values. It seems to have no
real upside and several downsides. Get the buffer size small enough
and you will still get pauses to clear the writes the be pauses will
be short enough to not be a problem.
Not a normal situation, but I should mention my recent experience.
One of the disks in an underlying RAID was going bad. It still worked,
but the disk struggled manfully with multiple retries and recalibrates
to complete many reads/writes - i.e. it was extremely slow. I was
running into all kinds of strange boundary conditions because of this.
E.g. VMs were getting timeouts on their virtio disk devices, leading
to file system corruption and other issues.
I was not modifying any LVM volumes, so did not run into any problems
with LVM - but that is a boundary condition to keep in mind. You
don't necessarily need to fully work under such conditions, but need
to do something sane.
On SAS or NL-SAS drives?
I've seen this before on SATA drives, and is probably the single biggest
reason why I have a major dislike for deploying SATA drives to any kind
of high-reliability environment.
Regardless, we do monitor all cases using smartd and it *usually* picks
up a bad drive before it gets to the above point of pain but with SATA
drives this isn't always the case, and the drive will simply
indefinitely keep retrying, blocking request slot numbers over time
(SATA protocol handles 32 requests IIRC, but after how long can the
Linux kernel re-use a number it has never received a response on kind of
problem) and getting slower and slower until you power cycle the drive,
after which it's fine again for a while. Never had that crap with
NL-SAS drives.
Specific host is all NL-SAS. No VM involvement here.
Kind regards,
Jaco