Re: [PATCHv4 0/6] scsi: use xarray for devices and targets

Douglas Gilbert <dgilbert@xxxxxxxxxxxx> · Wed, 3 Jun 2020 14:23:55 -0400

On 2020-06-03 8:53 a.m., Christoph Hellwig wrote:
On Tue, Jun 02, 2020 at 01:33:05PM +0200, Hannes Reinecke wrote:
Hi all,

based on the ideas from Doug Gilbert here's now my take on using
xarrays for devices and targets.
It revolves around two ideas:

- The scsi target 'channel' and 'id' numbers are never ever used
   to the full 32 bit range; channels are well below 10, and no
   driver is using more than 16 bits for the id. So we can reduce
   the type of 'channel' and 'id' to 16 bits, and use the 32 bit
   value 'channel << 16 | id' as the index into the target xarray.
- Nearly every target only ever uses the first two levels of the
   4-level SCSI LUN structure, which means that we can use the
   linearized SCSI LUN id as an index into the xarray.
   If we ever come across targets utilizing more that 2 levels of
   the LUN structure we'll allocate the first unused index and have
   to resort to a less efficient lookup instead of direct indexing.

With these changes we can implement an efficient lookup mechanism,
devolving into direct lookup for most cases. It also allows us to
detect duplicate entries or accidental overwrites of existing elements
by using xa_cmpxchg().
And iteration over targets and devices should be as efficient as the
current, list-based, approach.

As usual, comments and reviews are welcome.

I see absolutely no argument for what the point of this series.  It adds
more code, and I don't really see any indications for it fixing bugs,
speeding up workloads, or reducing memory usage.

Lets take memory usage first. The legacy design (part of which may have
been a later add-on) has three collections where two are needed:
   1) all targets in a host
   2) all sdev_s in a target
   3) all sdev_s in a host

So the third one is redundant and now removed (together with the
complexity of making sure those 3 collections are always in sync, seen
from the users' viewpoint). Each doubly linked collection on 64
bit machines uses 16 bytes (2 eight byte pointers). So that is a
32 byte reduction in each sdev object. The proposed solution adds 0
bytes because it uses the LUN as an index which is already there.
Similar but smaller win in scsi_target objects.

There are also some locks and mutexes in the three level object
tree (host-target-sdev[LU]) that can probably be dispensed with
as xarrays come with their own locks. That has not been done yet
making both my earlier proposal and this one "overlocked". And
locks and mutexes take up space in objects and slow things down.

The speeding up will come in big machine startup and shutdown and
its reaction time to disruptions (e.g. cable disconnected to a disk
array) IMO. xarray and explicit parent pointers give us a faster
way to navigate up and down the object tree. With this patchset we
have an O(ln(n)) lookup in the downward direction where currently we
only have O(n). Very little use is made of the "lookup" functions in
the API because users could see that it was just an iteration
(i.e. O(n)). Hopefully transports will take advantage of faster
lookups and perhaps implement their own xarrays. Even the upward
navigation can be complicated by transports inserting levels between
the host and the target. This is what the SCSI mid-layer object tree
looks like moving upwards from a SAS SSD, connected to an
SAS expander, moving up to its host (a HBA):
    scsi_device, ptr=ffff99d23f513960
    scsi_target, ptr=ffff99d241595c28
    sas_rphy, ptr=ffff99d242519c00
    sas_port, ptr=ffff99d24251ec00
    sas_expander_device, ptr=ffff99d23f4c6438
    sas_port, ptr=ffff99d23f4c7400
    Scsi_Host, ptr=ffff99d2425261f8

There already is a scsi_device::host redundant pointer to bypass the
oft-called and slow-walking dev_to_shost(). I'm proposing another
redundant scsi_target::parent_shost pointer that will bypass seven
dev_to_shost() invocations.

Currently all iterations are done under the host_lock as that is
required for doubly linked list safety. xarray uses rcu read locks
on all non-modifying operations including iterations and if we can
safely rely on them, that will increase the available parallelism
within one host.

Finally the SCSI fast path will usually require the presence the
corresponding sdev object, preferably cached. So making it smaller
will help.

Doug Gilbert

P.S. I sidestepped the "bugs" issue. Surely we will add some but
it is hard to believe when you wade into the complexity of the
currently linked collections and their myriad of locks, that there
aren't subtle bugs in the existing code. I have been working with
xarrays for about 1 year and finding locking issues is easier
with xarrays compared to "roll your own" linked list locking, IMO.