You might try limited udev's children to say 4 rather than the default, the default seems to be based on some % of total cpus (it seems to change based on the udev version), so if the cpu counts are large then there can be a large number of children quickly created. I have seen the udev children issue on machines that have around 70-100 cpus or more and huge numbers of luns (100-1000's), but it also seems to get much worse at a specific OS update (no idea which one that was). It spawns the processes so fast that the machine uses excessive cpu both user and system time, and depending on your scsi layer timeouts the resulting system time may be significant enough to cause scsi layer commands to timeout. I have seen larger machines with a lot of luns create so many children that the machine takes excessive time (sytemd times out and fails to find VG/LV's in 90 seconds and goes to emergency mode, and by the time we get to the prompt everything is fine). We find when this happens that udev collected like 80+ minutes of cpu time in 90 seconds it takes to boot. Turning the children down actually resulted in much faster and consistent boots. I have not noticed a udev process getting an error or causing something like you are seeing, but if enough process gets spawned causing excessive system time it could cause timeouts in the kernel subsystems that have timeouts. ps axuwwS | grep -i udev will show you the total time udev and all of its exited children collected during boot. If the number is large in the short time it take to boot there is massive contention. On Wed, Nov 4, 2020 at 8:33 PM Brian Bunker <brian@xxxxxxxxxxxxxxx> wrote: > > Hello all, > > We have run into situations internally where we end up with SCSI devices ending up in two multipath devices: > > 3624a9370c07877102b14369900011fd1 dm-9 PURE ,FlashArray > size=500G features='0' hwhandler='1 alua' wp=rw > `-+- policy='service-time 0' prio=50 status=active > |- 0:0:0:6 sdeg 128:128 active ready running > |- 0:0:1:6 sdeq 129:32 active ready running > |- 0:0:2:6 sdfa 129:192 active ready running > |- 7:0:0:6 sdx 65:112 active ready running > |- 7:0:1:6 sdah 66:16 active ready running > |- 7:0:2:6 sdci 69:96 active ready running > |- 7:0:3:6 sdn 8:208 active ready running > |- 8:0:0:6 sdbb 67:80 active ready running > |- 8:0:1:6 sdbt 68:112 active ready running > |- 8:0:2:6 sdcx 70:80 active ready running > |- 8:0:3:6 sdar 66:176 active ready running > |- 9:0:0:6 sddj 71:16 active ready running > |- 9:0:1:6 sdcq 69:224 active ready running > |- 9:0:2:6 sddt 71:176 active ready running > |- 9:0:3:6 sdbo 68:32 active ready running > `- 0:0:3:6 sdg 8:96 active ready running > > And this one: > SPURE_FlashArray_C07877102B14369900011FD1 dm-13 PURE ,FlashArray > size=500G features='0' hwhandler='1 alua' wp=rw > `-+- policy='service-time 0' prio=50 status=active > `- 0:0:3:6 sdg 8:96 active ready running > > What this comes down to seems to lie with the scsi_id application included included with udev. This runs out of a udev rule when SCSI devices are discovered. What happens seems to be a failure in page 0x83 INQUIRY in getting the serial number of the device to use in multipath. > > When page 0x83 INQUIRY fails, instead of getting the expected 3624a9370c07877102b14369900011fd1 it falls back to a page 0x80 INQUIRY where it stitches together ’S’ + vendor name + model name + LUN serial number and we end up with this SPURE_FlashArray_C07877102B14369900011FD1. > > A quick look shows that the LUN serial number in both cases is the same, C07877102B14369900011FD1. We end up with two multipath devices since the name multipath uses is not the same. I can create the situation manually by rescanning at just the right time forcing the page 0x83 INQUIRY to fail when manually running scsi_id. I assume that this is what happens as the devices arrive and the udev rule runs based on the name. I am not exactly sure where in the chain to blame for the creation of the two multipath devices. I am not sure that multipath is not just the victim of udev, but I expect that this behavior is not what multipath would ever want to do since it is a certain corruption when it happens. > > Any ideas or insight on what could keep us out of this situation would be appreciated. > > Thanks, > Brian > > Brian Bunker > SW Eng > brian@xxxxxxxxxxxxxxx > > > > > -- > dm-devel mailing list > dm-devel@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel