On 04/30/2017 06:29 AM, Xiubo Li wrote: > [...] >>> +static struct page *tcmu_try_get_block_page(struct tcmu_dev *udev, >>> uint32_t dbi) >>> +{ >>> + struct page *page; >>> + int ret; >>> + >>> + mutex_lock(&udev->cmdr_lock); >>> + page = tcmu_get_block_page(udev, dbi); >>> + if (likely(page)) { >>> + mutex_unlock(&udev->cmdr_lock); >>> + return page; >>> + } >>> + >>> + /* >>> + * Normally it shouldn't be here: >>> + * Only when the userspace has touched the blocks which >>> + * are out of the tcmu_cmd's data iov[], and will return >>> + * one zeroed page. >> >> Is it a userspace bug when this happens? Do you know when it is >> occcuring? > Since the UIO will map the whole ring buffer to the user space at the > beginning, and the userspace is allowed and legal to access any block > within the limits of the mapped ring area. > > But actually when this happens, it normally will be one bug of the > userspace. Without this checking the kernel will output many page fault > dump traces. > > Maybe here outputing some warning message is a good idea, and will be > easy to debug for userspace. Yeah. > > > [...] >>> @@ -1388,6 +1509,81 @@ static ssize_t tcmu_cmd_time_out_store(struct >>> config_item *item, const char *pag >>> .tb_dev_attrib_attrs = NULL, >>> }; >>> +static int unmap_thread_fn(void *data) >>> +{ >>> + struct tcmu_dev *udev; >>> + loff_t off; >>> + uint32_t start, end, block; >>> + struct page *page; >>> + int i; >>> + >>> + while (1) { >>> + DEFINE_WAIT(__wait); >>> + >>> + prepare_to_wait(&unmap_wait, &__wait, TASK_INTERRUPTIBLE); >>> + schedule(); >>> + finish_wait(&unmap_wait, &__wait); >>> + >>> + mutex_lock(&root_udev_mutex); >>> + list_for_each_entry(udev, &root_udev, node) { >>> + mutex_lock(&udev->cmdr_lock); >>> + >>> + /* Try to complete the finished commands first */ >>> + tcmu_handle_completions(udev); >>> + >>> + /* Skip the udevs waiting the global pool or in idle */ >>> + if (udev->waiting_global || !udev->dbi_thresh) { >>> + mutex_unlock(&udev->cmdr_lock); >>> + continue; >>> + } >>> + >>> + end = udev->dbi_max + 1; >>> + block = find_last_bit(udev->data_bitmap, end); >>> + if (block == udev->dbi_max) { >>> + /* >>> + * The last bit is dbi_max, so there is >>> + * no need to shrink any blocks. >>> + */ >>> + mutex_unlock(&udev->cmdr_lock); >>> + continue; >>> + } else if (block == end) { >>> + /* The current udev will goto idle state */ >>> + udev->dbi_thresh = start = 0; >>> + udev->dbi_max = 0; >>> + } else { >>> + udev->dbi_thresh = start = block + 1; >>> + udev->dbi_max = block; >>> + } >>> + >>> + /* Here will truncate the data area from off */ >>> + off = udev->data_off + start * DATA_BLOCK_SIZE; >>> + unmap_mapping_range(udev->inode->i_mapping, off, 0, 1); >>> + >>> + /* Release the block pages */ >>> + for (i = start; i < end; i++) { >>> + page = radix_tree_delete(&udev->data_blocks, i); >>> + if (page) { >>> + __free_page(page); >>> + atomic_dec(&global_db_count); >>> + } >>> + } >>> + mutex_unlock(&udev->cmdr_lock); >>> + } >>> + >>> + /* >>> + * Try to wake up the udevs who are waiting >>> + * for the global data pool. >>> + */ >>> + list_for_each_entry(udev, &root_udev, node) { >>> + if (udev->waiting_global) >>> + wake_up(&udev->wait_cmdr); >>> + } >> >> To avoid starvation, I think you want a second list/fifo that holds the >> watiers. In tcmu_get_empty_block if the list is not empty, record how >> many pages we needed and then add the device to the list and wait in >> tcmu_queue_cmd_ring. >> >> Above if we freed enough pages for the device at head then wake up the >> device. >> >> I think you also need a wake_up call in the completion path in case the >> initial call could not free enough pages. It could probably check if the >> completion was going to free enough pages for a waiter and then call >> wake. >> > Yes, I meant to introduce this later after this series to not let the > patches too > complex to review. > > If you agree I will do this later, or in V7 series ? Yeah, I am ok with adding it after the initial patches go in.