The cache management is mainly derived by 2 types of events: the device hints on one hand, and read & write operations on the other. On host management mode, the host may choose not to follow those hints, but we will. Both regions and subregions can be either active or inactive. We will monitor the lun’s regions/subregions state using a designated worker, which will gets scheduled on each event. The region state is derived from the state of its subregions: it becomes active once its first subregion becomes active, and inactive with its last. The device may send to the host an activation signal for one of its inactive subregions, provided that the region is active. For an inactive region such activation trial can only be a result of a read request. Next we try to allocate the required pages to that subregion and update its L2P mapping by sending HPB_READ_BUFFER command. Upon successful completion the subregion becoms active. A Third option in which we update the subregion's L2P cache, is when the device may send an update signal to the host for an active subregion. This is because the host is not entirely aware of the device’s in-house activities (GC, etc.). We distinguish such notifications to an active subregion as an “update” event instead of “activate” event to an inactive subregion, but both derived from the same device’s activation recommendation. In that case, the subregion change its state to inactive, but does not release its pages, nor the region changes its active count. Upon successful completion of HPB_READ_BUFFER the subregion returns to be active, otherwise become inactive. A subregion can become inactive if its writes count (dirty) exceeds a given threshold. In order not to hang on to “cold” subregions, we shall inactivate a subregion that has no READ access for a predefined amount of time. HPB_READ_BUFFER is used to update the subregion's L2P cache. Upon its successful completion the subregion is considered to be active. In the allocation length field of the HPB_READ_BUFFER command, we fill entries_per_subregion x 8bytes, regardless of the actual subregion size. For LUNs that are not subregion aligned, the actual subregion size of the last subregion might be smaller. However, as there can be no valid LBA that exceeds the valid part of that subregion, and which shall not be accessed, it suffices to maintain a constant allocation length. One last implementation point is calling scsi_execute with command anon to SCSI still works for older kernels. This is because this flow trust its users to fill the CDB unconditionally, and the SCSI command setup is bypassed altogether. This will no longer work, as all of this code was removed last year in commit a1ce35fa4985 ("block: remove dead elevator code"). We will most likely need to change this in our next versions. Region eviction refer to the act of activating a region on the expense of inactivating another region. This can only occur when active regions count is at its max, and the system might benefit from activating another region. This act is a stranger to our basic concept that region activation is derived from its subregion activation, but entails some possible overall gain. On region eviction, the leaving region is deactivated, and the entering region is becoming active by activating the subregion that cause the eviction trial. The region eviction work is triggered by the same conditions of subregion activation – if a subregion belongs to an inactive region and while all other subregion activation restrictions applies. Evicting an active region is an extreme operation, so the threshold for both entering and exiting region are quite high. The entering region reads should cross the EVICTION_READS threshold, and the exiting region should have less than half of that amount of reads. We are aware that there can be many contender subregions trying to evict someone simultaneously. Thus we are risking by cyclic-eviction: evicting a region that just entered by evicting another region. That’s fine, and we don’t want to interfere this process – just let it play, as “stronger” regions enter, the eviction process is bound to converge. last but not least - the entering subrigion should be clean, so we can benefit HPB_READing it, at least in the short term. Signed-off-by: Avri Altman <avri.altman@xxxxxxx> --- drivers/scsi/device_handler/scsi_dh_ufshpb.c | 521 ++++++++++++++++++++++++++- 1 file changed, 520 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/device_handler/scsi_dh_ufshpb.c b/drivers/scsi/device_handler/scsi_dh_ufshpb.c index 9b9b338..c94a616 100644 --- a/drivers/scsi/device_handler/scsi_dh_ufshpb.c +++ b/drivers/scsi/device_handler/scsi_dh_ufshpb.c @@ -7,6 +7,7 @@ * Copyright (C) 2020 Western Digital Corporation or its affiliates. */ +#include <asm/unaligned.h> #include <linux/slab.h> #include <linux/module.h> #include <scsi/scsi.h> @@ -17,6 +18,13 @@ #define UFSHPB_NAME "ufshpb" +#define UFSHPB_WRITE_BUFFER (0xfa) +#define WRITE_BUFFER_TIMEOUT (3 * HZ) +#define WRITE_BUFFER_RETRIES (3) +#define UFSHPB_READ_BUFFER (0xf9) +#define READ_BUFFER_TIMEOUT (3 * HZ) +#define READ_BUFFER_RETRIES (3) + #define HPB_ENTRY_SIZE (8) /* subregion activation threshold - % of subregion size */ @@ -34,6 +42,32 @@ /* region eviction threshold: 3 x subregions_per_region x MIN_READS */ #define EVICTION_READS (3 * ACTIVATION_READS) +#define to_subregion() (container_of(work, struct ufshpb_subregion, hpb_work)) + +enum ufshpb_state { + HPB_STATE_INACTIVE, + HPB_STATE_ACTIVE, +}; + +/** + * ufshpb_subregion_events + * @SUBREGION_EVENT_ACTIVATE - device signal to a non-active subregion + * @SUBREGION_EVENT_UPDATE - device signal to an active subregion + * @SUBREGION_EVENT_READ - a new read request to a non-active subregion + * @SUBREGION_EVENT_DIRTY_THRESHOLD - a dirty threshold crossed + * @SUBREGION_EVENT_TIMER - timer since last read expired + */ +enum ufshpb_subregion_events { + SUBREGION_EVENT_ACTIVATE, + SUBREGION_EVENT_UPDATE, + SUBREGION_EVENT_READ, + SUBREGION_EVENT_DIRTY_THRESHOLD, + SUBREGION_EVENT_TIMER, + + /* keep last */ + SUBREGION_EVENT_MAX +}; + /** * struct map_ctx - a mapping context * @pages - array of pages @@ -55,6 +89,9 @@ struct ufshpb_map_ctx { * @read_timeout_expiries - to rewind the read_timer for "clean" subregion * @region - region index * @subregion - subregion index + * @state - one of @ufshpb_state + * @event - one of @ufshpb_subregion_events + * @hpb_work - worker to implement state machine */ struct ufshpb_subregion { struct ufshpb_map_ctx *mctx; @@ -66,6 +103,11 @@ struct ufshpb_subregion { atomic_t read_timeout_expiries; unsigned int region; unsigned int subregion; + + unsigned state:1; + unsigned long event; + struct mutex state_lock; + struct work_struct hpb_work; }; /** @@ -76,6 +118,7 @@ struct ufshpb_subregion { * @writes - sum over subregions @writes * @region - region index * @active_subregions - actual active subregions + * @evicted - to indicated if this region is currently being evicted */ struct ufshpb_region { struct ufshpb_subregion *subregion_tbl; @@ -85,6 +128,7 @@ struct ufshpb_region { unsigned int region; atomic_t active_subregions; + atomic_t evicted; }; /** @@ -93,6 +137,7 @@ struct ufshpb_region { * @lh_map_ctx - list head of mapping context * @map_list_lock - to protect mapping context list operations * @region_tbl - regions/subregions table + * @pinned_map - to mark pinned regions * @sdev - scsi device for that lun * @regions_per_lun * @subregions_per_lun - lun size is not guaranteed to be region aligned @@ -105,6 +150,7 @@ struct ufshpb_dh_lun { struct list_head lh_map_ctx; spinlock_t map_list_lock; struct ufshpb_region *region_tbl; + unsigned long *pinned_map; struct scsi_device *sdev; unsigned int regions_per_lun; @@ -113,6 +159,10 @@ struct ufshpb_dh_lun { unsigned int max_active_regions; atomic_t active_regions; + + struct mutex eviction_lock; + + struct workqueue_struct *wq; }; struct ufshpb_dh_data { @@ -134,6 +184,440 @@ static unsigned long region_size; static unsigned long subregion_size; static unsigned int subregions_per_region; +static inline void ufshpb_set_write_buf_cmd(unsigned char *cmd, + unsigned int region) +{ + cmd[0] = UFSHPB_WRITE_BUFFER; + cmd[1] = 0x01; + put_unaligned_be16(region, &cmd[2]); +} + +static int ufshpb_submit_write_buf_cmd(struct scsi_device *sdev, + unsigned int region) +{ + unsigned char cmd[10] = {}; + struct scsi_sense_hdr sshdr = {}; + u64 flags = REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | + REQ_FAILFAST_DRIVER; + int timeout = WRITE_BUFFER_TIMEOUT; + int cmd_retries = WRITE_BUFFER_RETRIES; + int ret = 0; + + ufshpb_set_write_buf_cmd(cmd, region); + + ret = scsi_execute(sdev, cmd, DMA_NONE, NULL, 0, NULL, &sshdr, + timeout, cmd_retries, flags, 0, NULL); + + /* HPB spec does not define any error handling */ + sdev_printk(KERN_INFO, sdev, "%s: WRITE_BUFFER %s result %d\n", + UFSHPB_NAME, ret ? "failed" : "succeeded", ret); + + return ret; +} + +static void ufshpb_region_inactivate(struct ufshpb_dh_lun *hpb, + struct ufshpb_region *r, bool reset) +{ + if (!reset) + ufshpb_submit_write_buf_cmd(hpb->sdev, r->region); + + atomic_dec(&hpb->active_regions); +} + +static void __subregion_inactivate(struct ufshpb_dh_lun *hpb, + struct ufshpb_region *r, + struct ufshpb_subregion *s) +{ + if (!s->mctx) + return; + + if (s->mctx->pages) { + free_pages((unsigned long)s->mctx->pages, order); + s->mctx->pages = NULL; + } + + spin_lock(&hpb->map_list_lock); + list_add(&s->mctx->list, &hpb->lh_map_ctx); + spin_unlock(&hpb->map_list_lock); + s->mctx = NULL; +} + +static void ufshpb_subregion_inactivate(struct ufshpb_dh_lun *hpb, + struct ufshpb_region *r, + struct ufshpb_subregion *s, bool reset) +{ + lockdep_assert_held(&s->state_lock); + + if (s->state == HPB_STATE_INACTIVE) + return; + + if (test_bit(r->region, hpb->pinned_map)) + return; + + __subregion_inactivate(hpb, r, s); + + s->state = HPB_STATE_INACTIVE; + + if (atomic_dec_and_test(&r->active_subregions)) + ufshpb_region_inactivate(hpb, r, reset); +} + +static void ufshpb_set_read_buf_cmd(unsigned char *cmd, unsigned int region, + unsigned int subregion, + unsigned int alloc_len) +{ + cmd[0] = UFSHPB_READ_BUFFER; + cmd[1] = 0x01; + put_unaligned_be16(region, &cmd[2]); + put_unaligned_be16(subregion, &cmd[4]); + + cmd[6] = alloc_len >> 16; + cmd[7] = (alloc_len >> 8) & 0xff; + cmd[8] = alloc_len & 0xff; + cmd[9] = 0x00; +} + +static bool ufshpb_read_buf_cmd_need_retry(struct ufshpb_dh_lun *hpb, + struct scsi_device *sdev, + struct scsi_sense_hdr *sshdr, + int result) +{ + if (atomic_read(&hpb->active_regions) >= hpb->max_active_regions) + return false; + + /* UFS3.1 spec says: "... + * If the device is performing internal maintenance work, for example + * Garbage collection, the device may terminate the HPB READ BUFFER + * command by sending RESPONSE UPIU with CHECK CONDITION status, with + * the SENSE KEY set to ILLEGAL REQUEST, and the additional sense code + * set OPERATION IN PROGRESS. In this case, the host can retry the HPB + * READ BUFFER command..." + */ + if (scsi_sense_valid(sshdr) && + status_byte(result) == CHECK_CONDITION && + sshdr->sense_key == ILLEGAL_REQUEST && sshdr->asc == 0x00 + && sshdr->ascq == 0x16) + return true; + + return false; +} + +static int ufshpb_submit_read_buf_cmd(struct ufshpb_dh_lun *hpb, + struct ufshpb_subregion *s) +{ + struct scsi_device *sdev = hpb->sdev; + unsigned char cmd[10] = {}; + struct scsi_sense_hdr sshdr = {}; + char *buf = s->mctx->pages; + unsigned int alloc_len = entries_per_subregion * HPB_ENTRY_SIZE; + int retries = READ_BUFFER_RETRIES; + int ret; + + ufshpb_set_read_buf_cmd(cmd, s->region, s->subregion, alloc_len); + + do { + u64 flags = REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | + REQ_FAILFAST_DRIVER; + unsigned int buflen = alloc_len; + + memset(&sshdr, 0, sizeof(sshdr)); + memset(buf, 0, buflen); + ret = scsi_execute(sdev, cmd, DMA_FROM_DEVICE, buf, buflen, + NULL, &sshdr, READ_BUFFER_TIMEOUT, + READ_BUFFER_RETRIES, flags, 0, NULL); + + if (!ret) + return ret; + + } while (ufshpb_read_buf_cmd_need_retry(hpb, sdev, &sshdr, ret) && + retries--); + + return ret; +} + +static int ufshpb_subregion_alloc_pages(struct ufshpb_dh_lun *hpb, + struct ufshpb_subregion *s) +{ + struct ufshpb_map_ctx *mctx; + + spin_lock(&hpb->map_list_lock); + mctx = list_first_entry_or_null(&hpb->lh_map_ctx, + struct ufshpb_map_ctx, list); + if (!mctx) { + spin_unlock(&hpb->map_list_lock); + return -EINVAL; + } + + list_del_init(&mctx->list); + spin_unlock(&hpb->map_list_lock); + + s->mctx = mctx; + mctx->pages = (char *)__get_free_pages(GFP_KERNEL, order); + if (!mctx->pages) + return -ENOMEM; + + return 0; +} + +static int __subregion_activate(struct ufshpb_dh_lun *hpb, + struct ufshpb_region *r, + struct ufshpb_subregion *s, bool alloc_pages) +{ + int ret; + + if (alloc_pages) { + ret = ufshpb_subregion_alloc_pages(hpb, s); + if (ret) + goto out; + } + + if (!s->mctx || !s->mctx->pages) { + ret = -EINVAL; + goto out; + } + + ret = ufshpb_submit_read_buf_cmd(hpb, s); + if (ret) + goto out; + + atomic64_sub(atomic64_read(&s->writes), &r->writes); + atomic64_set(&s->writes, 0); + +out: + if (ret) + __subregion_inactivate(hpb, r, s); + + return ret; +} + +static inline bool ufshpb_can_send_read_buffer(struct ufshpb_subregion *s) +{ + unsigned int activation_threshold = READS_READ_BUFFER_SENT * + entries_per_subregion / 100; + + return atomic64_read(&s->reads) - atomic64_read(&s->reads_last_trial) + >= activation_threshold; +} + +static int ufshpb_subregion_activate(struct ufshpb_dh_lun *hpb, + struct ufshpb_region *r, + struct ufshpb_subregion *s, bool init) +{ + int ret; + + lockdep_assert_held(&s->state_lock); + + if (s->state == HPB_STATE_ACTIVE || atomic_read(&r->evicted)) + return 0; + + /* do not flood the driver with HPB_READ_BUFFER commands */ + if (!init && !ufshpb_can_send_read_buffer(s)) + return 0; + + /* if this is the first subregion to become active */ + if (atomic_inc_return(&r->active_subregions) == 1) { + /* check that max_active_regions is not exceeded */ + if (atomic_inc_return(&hpb->active_regions) > + hpb->max_active_regions) { + atomic_dec(&hpb->active_regions); + atomic_dec(&r->active_subregions); + return 0; + } + } + + ret = __subregion_activate(hpb, r, s, true); + atomic64_set(&s->reads_last_trial, atomic64_read(&s->reads)); + if (ret) { + if (atomic_dec_and_test(&r->active_subregions)) + atomic_dec(&hpb->active_regions); + return ret; + } + + s->state = HPB_STATE_ACTIVE; + + return 0; +} + +static struct ufshpb_region *__find_region_to_evict(struct ufshpb_dh_lun *hpb, + struct ufshpb_region *r) +{ + struct ufshpb_region *regions = hpb->region_tbl; + struct ufshpb_region *exiting_region = NULL; + unsigned int eviction_threshold; + unsigned int exit_threshold; + unsigned int lowest_reads; + int i; + + eviction_threshold = EVICTION_READS * entries_per_region / 100; + if (atomic64_read(&r->reads) < eviction_threshold) + goto out; + + exit_threshold = eviction_threshold >> 1; + lowest_reads = exit_threshold; + for (i = 0; i < hpb->regions_per_lun; i++) { + struct ufshpb_region *rr = regions + i; + unsigned int tmp_reads; + + if (r == rr) + continue; + + if (test_bit(rr->region, hpb->pinned_map)) + continue; + + if (!atomic_read(&rr->active_subregions)) + continue; + + tmp_reads = atomic64_read(&rr->reads); + if (tmp_reads < lowest_reads) { + lowest_reads = tmp_reads; + exiting_region = rr; + } + } + +out: + return exiting_region; +} + +static int __region_evict(struct ufshpb_dh_lun *hpb, + struct ufshpb_region *exiting_region, bool reset) +{ + struct ufshpb_subregion *subregions = exiting_region->subregion_tbl; + int ret = 0; + int j; + + /* mark that region as being evicted */ + atomic_set(&exiting_region->evicted, 1); + + for (j = 0; j < subregions_per_region; j++) { + struct ufshpb_subregion *s = subregions + j; + + if (s->state != HPB_STATE_ACTIVE) + continue; + + mutex_lock(&s->state_lock); + ufshpb_subregion_inactivate(hpb, exiting_region, s, reset); + mutex_unlock(&s->state_lock); + } + + atomic_set(&exiting_region->evicted, 0); + + return ret; +} + +static int ufshpb_region_try_evict(struct ufshpb_dh_lun *hpb, + struct ufshpb_region *r, + struct ufshpb_subregion *s) +{ + struct ufshpb_region *exiting_region = NULL; + + lockdep_assert_held(&s->state_lock); + + if (s->state == HPB_STATE_ACTIVE || + atomic64_read(&s->writes) > SET_AS_DIRTY) + return -EINVAL; + + exiting_region = __find_region_to_evict(hpb, r); + if (!exiting_region) + return -EINVAL; + + /* + * before inactivating the exiting region check again - some regions + * may become free, so just activate the entring subregion + */ + if (atomic_read(&hpb->active_regions) >= hpb->max_active_regions) + __region_evict(hpb, exiting_region, false); + + /* activate the entering subregion */ + ufshpb_subregion_activate(hpb, r, s, false); + + return 0; +} + +static void ufshpb_subregion_update(struct ufshpb_dh_lun *hpb, + struct ufshpb_region *r, + struct ufshpb_subregion *s, bool reset) +{ + lockdep_assert_held(&s->state_lock); + + if (s->state != HPB_STATE_ACTIVE) + return; + + /* bail out if the region is currently being evicted */ + if (atomic_read(&r->evicted)) + return; + + /* do not flood the driver with HPB_READ_BUFFER commands */ + if (!reset && !ufshpb_can_send_read_buffer(s)) + return; + + s->state = HPB_STATE_INACTIVE; + + if (__subregion_activate(hpb, r, s, false)) { + ufshpb_subregion_inactivate(hpb, r, s, false); + return; + } + + s->state = HPB_STATE_ACTIVE; +} + +/* + * ufshpb_dispatch - ufshpb state machine + * make the various decisions based on region/subregion state & events + */ +static void ufshpb_dispatch(struct ufshpb_dh_lun *hpb, struct ufshpb_region *r, + struct ufshpb_subregion *s) +{ + unsigned long e; + int i; + + lockdep_assert_held(&s->state_lock); + + e = find_first_bit(&s->event, SUBREGION_EVENT_MAX); + + switch (e) { + case SUBREGION_EVENT_READ: + if (atomic_read(&r->active_subregions) < 1 && + atomic_read(&hpb->active_regions) >= + hpb->max_active_regions) { + /* potential eviction may take place */ + mutex_lock(&hpb->eviction_lock); + ufshpb_region_try_evict(hpb, r, s); + mutex_unlock(&hpb->eviction_lock); + break; + } + /* fall through */ + case SUBREGION_EVENT_ACTIVATE: + ufshpb_subregion_activate(hpb, r, s, false); + break; + + case SUBREGION_EVENT_UPDATE: + ufshpb_subregion_update(hpb, r, s, false); + break; + + case SUBREGION_EVENT_DIRTY_THRESHOLD: + case SUBREGION_EVENT_TIMER: + ufshpb_subregion_inactivate(hpb, r, s, false); + break; + + default: + break; + } + + for (i = 0; i < SUBREGION_EVENT_MAX; i++) + clear_bit(i, &s->event); +} + +static void ufshpb_work_handler(struct work_struct *work) +{ + struct ufshpb_subregion *s = to_subregion(); + + mutex_lock(&s->state_lock); + + ufshpb_dispatch(s->hpb, s->r, s); + + mutex_unlock(&s->state_lock); +} + static int ufshpb_mempool_init(struct ufshpb_dh_lun *hpb) { unsigned int max_active_subregions = hpb->max_active_regions * @@ -192,6 +676,8 @@ static int ufshpb_region_tbl_init(struct ufshpb_dh_lun *hpb) s->r = r; s->region = i; s->subregion = j; + INIT_WORK(&s->hpb_work, ufshpb_work_handler); + mutex_init(&s->state_lock); } r->subregion_tbl = subregions; @@ -200,6 +686,7 @@ static int ufshpb_region_tbl_init(struct ufshpb_dh_lun *hpb) } hpb->region_tbl = regions; + mutex_init(&hpb->eviction_lock); return 0; @@ -223,6 +710,8 @@ static int ufshpb_lun_configuration(struct scsi_device *sdev, const struct ufshpb_lun_config *lun_conf = h->ufshpb_lun_conf; struct ufshpb_dh_lun *hpb = NULL; int ret; + struct workqueue_struct *wq; + char wq_name[sizeof("ufshpb_wq_00")] = {}; if (h->hpb) return -EBUSY; @@ -236,6 +725,13 @@ static int ufshpb_lun_configuration(struct scsi_device *sdev, hpb->subregions_per_lun = DIV_ROUND_UP(lun_conf->size, subregion_size); hpb->last_subregion_size = lun_conf->size % subregion_size; + hpb->pinned_map = kcalloc(BITS_TO_LONGS(hpb->regions_per_lun), + sizeof(unsigned long), GFP_KERNEL); + if (!hpb->pinned_map) { + ret = -ENOMEM; + goto out_free; + } + ret = ufshpb_mempool_init(hpb); if (ret) goto out_free; @@ -244,6 +740,15 @@ static int ufshpb_lun_configuration(struct scsi_device *sdev, if (ret) goto out_free; + snprintf(wq_name, ARRAY_SIZE(wq_name), "ufshpb_wq_%d", sdev->id); + wq = alloc_workqueue(wq_name, WQ_HIGHPRI, WQ_MAX_ACTIVE); + if (!wq) { + ret = -ENOMEM; + goto out_free; + } + + hpb->wq = wq; + INIT_LIST_HEAD(&hpb->list); list_add(&hpb->list, &lh_hpb_lunp); @@ -251,8 +756,10 @@ static int ufshpb_lun_configuration(struct scsi_device *sdev, h->hpb = hpb; out_free: - if (ret) + if (ret) { + kfree(hpb->pinned_map); kfree(hpb); + } return ret; } @@ -336,6 +843,14 @@ static void ufshpb_region_tbl_remove(struct ufshpb_dh_lun *hpb) for (i = 0 ; i < hpb->regions_per_lun; i++) { struct ufshpb_region *r = regions + i; + int j; + + for (j = 0; j < subregions_per_region; j++) { + struct ufshpb_subregion *s = + r->subregion_tbl + j; + + cancel_work_sync(&s->hpb_work); + } kfree(r->subregion_tbl); } @@ -355,6 +870,10 @@ static void ufshpb_detach(struct scsi_device *sdev) { struct ufshpb_dh_data *h = sdev->handler_data; + if (h->hpb && h->hpb->wq) { + flush_workqueue(h->hpb->wq); + destroy_workqueue(h->hpb->wq); + } ufshpb_mempool_remove(h->hpb); ufshpb_region_tbl_remove(h->hpb); kfree(sdev->handler_data); -- 2.7.4