Re: [RFC v2 1/6] scsi: xarray hctl

Douglas Gilbert <dgilbert@xxxxxxxxxxxx> · Tue, 26 May 2020 15:10:08 -0400

On 2020-05-26 1:53 p.m., Hannes Reinecke wrote:
On 5/26/20 4:27 PM, Matthew Wilcox wrote:
On Tue, May 26, 2020 at 08:21:26AM +0200, Hannes Reinecke wrote:
On 5/25/20 7:40 PM, Matthew Wilcox wrote:
You aren't the first person to ask about having a 64-bit lookup on
32-bit machines.  Indeed, I remember Hannes asking for a 256 bit lookup
at LinuxCon in Prague.  I have always been reluctant to add such a thing
because the XArray uses quite a naive data type underneath.  It works well with
dense arrays but becomes very wasteful of memory for sparse arrays.

My understanding of SCSI-world is that most devices have a single
LUN 0.  Most devices that have multiple LUNs number them sequentially.
Some devices have either an internal structure or essentially pick random
LUNs for the devices they expose.

Not quite. You are correct that most devices have a single LUN 0
(think of libata :-), but those with several LUNs typically
enumerate them. In most cases the enumeration starts at 0 (or 1,
if LUN 0 is used for a management LUN), and reaches up to 256.
Some arrays use a different LUN layout, which means that the top
two bit of the LUN number are set, and possibly some intermediate
numbers, too. But the LUNs themselves are numbered consecutively, too;
it's just at a certain offset.
I've never seen anyone picking LUN numbers at random.

Ah, OK.  I think for these arrays you'd be better off accepting the cost
of an extra 4 bytes in the struct scsi_device rather than the cost of
storing the scsi_device at the LUN.

Let's just work an example where you have a 64-bit LUN with 4 ranges,
each of 64 entries (this is almost a best-case scenario for the XArray).
[0,63], 2^62+[0,63], 2^63+[0,63], 2^63+2^62+[0,63].

If we store them sequentially in an allocating XArray, we take up 256 *
4 bytes = 1kB extra space in the scsi_device.  The XArray will allocate
four nodes plus one node to hold the four nodes, which is 5 * 576 bytes
(2780 bytes) for a total of 3804 bytes.

Storing them in at their LUN will allocate a top level node which covers
bits 60-66, then four nodes, each covering bits of 54-59, another four
nodes covering bits 48-53, four nodes for 42-47, ...  I make it 41 nodes,
coming to 23616 bytes.  And the pointer chase to get to each LUN is
ten deep.  It'll mostly be cached, but still ...

Which is my worry, too.
In the end we're having a massively large array space (128bit if we take the 
numbers as they stand today), of which only a _tiny_ fraction is actually 
allocated.
We can try to reduce the array space by eg. restricting channels and targets to 
be 16 bits, and the LUN to be 32 bits.
But then we'll be having a hard time arguing; "Hey, we have a cool new feature, 
which has a really nice interface, but we can't support the same set of devices 
as we have now...".
That surely will go down well.

Maybe one should look at using an rbtree directly; _that_ could be made to work 
with an 128bit index, and lookup should be fast, too.
Let's see.

But still, the original question still stands: what would be the most
efficient way using xarrays here?
We have a four-level hierarchy Host:Channel:Target:LUN
and we need to lookup devices (and, occasinally, targets) per host.
At this time, 'channel' and 'target' are unsigned integer, and
LUNs are 64 bit.

It certainly seems sensible to me to have a per-host allocating XArray
to store the targets that belong to that host.  I imagine you also want
a per-target XArray for the LUNs that belong to that target.  Do you
also want a per-host XArray to store the LUNs so you can iterate all
LUNs per host as a single lookup rather than indirecting through the
target Xarray?  That's a design decision for you to make.

Seeing that I'm trying to use the existing HCTL number as index it's quite hard 
to have a per-host lookup structure, as this would inevitably
require a separate LUN index somewhere.
Which would mean to have a two-level xarray structure, where the first xarray 
hold the scsi targets, and the second one the LUNs per target.
Not super-nice, but doable.

Still thinking about the rbtree idea, though.

You already have one, and it handles its own locking. It is called xarray :-)

Circling back to where I started: xarrays that allocate their own indexes.

Apart from the LUN ***, the other indexes tend to be monotonic increasing, most
of the time, correct Hannes? So thinking about an xarray in a shost object
with pointers to starget objects. At run time if say a target disappears,
often another appears either with the same target_id or a larger target_id,
perhaps larger than any previously issued id. So with careful insertions
(xa_alloc() will allow us to precisely position replacements) we can maintain
that monotonic ascending order. As I showed in patch 6/6 we can easily
track the number of elements in each xarray, lets call it 'n'. For
n < 12 (say) we don't care and just do O(n) iterator searches. For n >= 12
we check if the xarray is in ascending order, if it isn't, we sort it by
the index we want (e.g. channel_id,target_id). For elements that we
change positions of, in the xarray, we need to change the variable I called 
'sh_idx' (i.e. the xarray index) in the corresponding starget object.

Why do this, to do a binary search when n >= 12, of course. Lookup time
is then O(ln(n)) with only around ln(n) cache-unfriendly visits to starget
objects.

And that sort can be much better than O(n.ln(n)) since at the point of an
insertion, if the xarray was known to be sorted before the insertion,
then we only need as single iteration through the xarray to place the
new element in the correct position. After that it would be handy to
xa_swap() function that gave us the new xarray index. Making sure the
xarray indexes don't become too sparse might also be addressed in the
sort.

Question to Matthew:
Will xa_find(&xarray, &idx, full_range, 0)
return NULL _and_ place the index of the lowest "hole" in that xarray
(or 1 beyond the upper limit of the range) in 'idx' ?

Note to others: the last argument in a typical xa_find() call
is either XA_PRESENT or some other mark the app is using, not 0.

Doug Gilbert

*** the SCSI REPORT LUNS command doesn't say anything about the _ordering_
    of LUNs or even if it is stable from one invocation to the next.
    There is nothing to stop the scsi_scan.c code from sorting the LUN
    inventory before applying it to the ML object tree.