Re: [PATCH v2 1/3] rust: block: introduce `kernel::block::mq` module

Andreas Hindborg <nmi@xxxxxxxxxxxx> · Sat, 01 Jun 2024 08:35:30 +0200

Benno Lossin <benno.lossin@xxxxxxxxx> writes:

[...]

>>>> +    /// Notify the block layer that the request has been completed without errors.
>>>> +    ///
>>>> +    /// This function will return `Err` if `this` is not the only `ARef`
>>>> +    /// referencing the request.
>>>> +    pub fn end_ok(this: ARef<Self>) -> Result<(), ARef<Self>> {
>>>
>>> I am not yet fully convinced that this is the way we should go. I think
>>> I would have to see a more complex usage of `Request` with that id <->
>>> request mapping that you mentioned. Because with how rnull uses this
>>> API, it could also have a `URef<Self>` parameter (URef := unique ARef).
>> 
>> I considered a `UniqueARef` but it would just move the error handing to
>> `ARef::into_unique` and then make `end_ok` infallible.
>> 
>> There are four states for a request that we need to track:
>> 
>> A) Request is owned by block layer (refcount 0)
>> B) Request is owned by driver but with zero `ARef`s in existence
>>    (refcount 1)
>> C) Request is owned by driver with exactly one `ARef` in existence
>>    (refcount 2)
>> D) Request is owned by driver with more than one `ARef` in existence
>>    (refcount > 2)
>> 
>> It is in the doc comments for `RequestDataWrapper` as well.
>> 
>> We need A and B to ensure we fail tag to request conversions for
>> requests that are not owned by the driver.
>> 
>> We need C and D to ensure that it is safe to end the request and hand back
>> ownership to the block layer.
>> 
>> I will ping you when I hook up the NVMe driver with this.
>
> Thanks. I think that since the C side doesn't use ref-counting, the
> lifecycle of a request is probably rather simple. Therefore we should
> try to also avoid refcounting in Rust and see if we can eg tie ending
> requests to the associated `TagSet` (ie require `&mut` on the tagset)
> and tie accessing requests to shared access to the `TagSet`. Then we
> would be able to avoid the refcount. But I will first have to take a
> look at the nvme driver to gauge the plausibility.

C side _does_ use ref-counting in the `ref` field of the C `struct
request`. I am not able to reuse that field for the state tracking I
need. Other users such as iostat will take references on the request and
we will not be able to tell if there are no more Rust refs to the
request from that field. We need a separate one.

Anyways I think we should go with the current implementation for now. We
can always change it, nothing is locked in stone.

[...]

>>>> +/// Store the result of `op(target.load())` in target, returning new value of
>>>> +/// taret.
>>>> +fn atomic_relaxed_op_return(target: &AtomicU64, op: impl Fn(u64) -> u64) -> u64 {
>>>> +    let mut old = target.load(Ordering::Relaxed);
>>>> +    loop {
>>>> +        match target.compare_exchange_weak(old, op(old), Ordering::Relaxed, Ordering::Relaxed) {
>>>> +            Ok(_) => break,
>>>> +            Err(x) => {
>>>> +                old = x;
>>>> +            }
>>>> +        }
>>>> +    }
>>>
>>> This seems like a reimplementation of `AtomicU64::fetch_update` to me.
>> 
>> It looks like it! Except this function is returning the updated value,
>> `fetch_update` is returning the old value.
>> 
>> Would you rather that I rewrite in terms of the library function?
>
> If you can just use the fetch_update function, then that would be better
> than (almost) reimplementing it. But if you really need to get the new
> value, then I guess it can't really be helped. (or do you think you can
> just apply `op` to the old value returned by `fetch_update`?)

I can implement `atomic_relaxed_op_return` in terms of `fetch_update` 👍

[...]

>>>> +                let place = place.cast::<bindings::blk_mq_tag_set>();
>>>> +
>>>> +                // SAFETY: try_ffi_init promises that `place` is writable, and
>>>> +                // zeroes is a valid bit pattern for this structure.
>>>> +                core::ptr::write_bytes(place, 0, 1);
>>>> +
>>>> +                /// For a raw pointer to a struct, write a struct field without
>>>> +                /// creating a reference to the field
>>>> +                macro_rules! write_ptr_field {
>>>> +                    ($target:ident, $field:ident, $value:expr) => {
>>>> +                        ::core::ptr::write(::core::ptr::addr_of_mut!((*$target).$field), $value)
>>>> +                    };
>>>> +                }
>>>> +
>>>> +                // SAFETY: try_ffi_init promises that `place` is writable
>>>> +                    write_ptr_field!(place, ops, OperationsVTable::<T>::build());
>>>> +                    write_ptr_field!(place, nr_hw_queues , nr_hw_queues);
>>>> +                    write_ptr_field!(place, timeout , 0); // 0 means default which is 30 * HZ in C
>>>> +                    write_ptr_field!(place, numa_node , bindings::NUMA_NO_NODE);
>>>> +                    write_ptr_field!(place, queue_depth , num_tags);
>>>> +                    write_ptr_field!(place, cmd_size , core::mem::size_of::<RequestDataWrapper>().try_into()?);
>>>> +                    write_ptr_field!(place, flags , bindings::BLK_MQ_F_SHOULD_MERGE);
>>>> +                    write_ptr_field!(place, driver_data , core::ptr::null_mut::<core::ffi::c_void>());
>>>> +                    write_ptr_field!(place, nr_maps , num_maps);
>>>
>>> I think that there is some way for pinned-init to do a better job here.
>>> I feel like we ought to be able to just write:
>>>
>>>     Opaque::init(
>>>         try_init!(bindings::blk_mq_tag_set {
>>>             ops: OperationsVTable::<T>::build(),
>>>             nr_hw_queues,
>>>             timeout: 0, // 0 means default, which is 30Hz
>>>             numa_node: bindings::NUMA_NO_NODE,
>>>             queue_depth: num_tags,
>>>             cmd_size: size_of::<RequestDataWrapper>().try_into()?,
>>>             flags: bindings::BLK_MQ_F_SHOULD_MERGE,
>>>             driver_data: null_mut(),
>>>             nr_maps: num_maps,
>>>             ..Zeroable::zeroed()
>>>         }? Error)
>>>         .chain(|tag_set| to_result(bindings::blk_mq_alloc_tag_set(tag_set)))
>>>     )
>>>
>>> But we don't have `Opaque::init` (shouldn't be difficult) and
>>> `bindings::blk_mq_tag_set` doesn't implement `Zeroable`. We would need
>>> bindgen to put `derive(Zeroable)` on certain structs...
>>>
>>> Another option would be to just list the fields explicitly, since there
>>> aren't that many. What do you think?
>> 
>> Both options sound good. Ofc the first one sounds more user friendly
>> while the latter one sounds easier to implement. Getting rid of the
>> unsafe blocks here would be really nice.
>
> I think since it is not that expensive in this case, you should go for
> the second approach.
> We can fix it later, when we get the proper bindgen support.

Cool, I will send a follow up patch with this 👍

Best regards,
Andreas