At very high diagchar logging rates (a HSPA+ ftp download with many instantaneous diag log response messages enabled) on quad-core phones, it was observed that after about 2 to 20 minutes (depends on logging rate) - the userspace read to /dev/diag would block, causing a freeze effect to the userspace program - and further attempts to log would randomly freeze at an earlier stage then a phone restart was required to do any meaningful logging. You can test this with "diag_mdlog", use a diag.cfg with high logging rate log_response_messages like 0x4222, 0x4179 and other fast logs. The patched kernel was tested on a "Google Nexus 4 with Android 4.2.2" (mako) phone - and had no more freezes even after 13 Hours of logging diag for over-night ftp tests, an older sister-kernel (essentially the same - but with many more pr_debug() to identify the causes) was also tested sucessfully on two other "Google Nexus 4" phones, one of them logged successfully through about 10 hours of fast ftp downloads - no freezes too. This fix consists of 4 main parts: 1. There were cases that diag_device_write() completed the "enque" and sets the data_ready flag during an ongoing diagchar_read() which would then clear the flag on completion - resulting in a lost data_ready notification and diagchar_read() would block by waiting for the lost wakeup. The extra flag (array) "data_ready_while_busy" was introduced to handle that situation. 2. It was observed that zero length buffers in diag_hsic_read_complete_callback() was causing problems too - sending zero length entries to the hsic_buf_tbl queue could cause it to never be properly dequeued and never "diagmem_free()". This is because diagfwd.c's diag_device_write() would successfully "enqueue" it but then diagchar_core.c's diagchar_read() would omit it because "if (driver->hsic_buf_tbl[i].length > 0)". So it won't be "dequeued" and also diagmem_free would not be called on it. This is fixed by not allowing zero-length entries into diag_device_write() - and then calling diagmem_free() on it instead. 3. Consecutive writes (try 10+ conf writes before reading) to diagchar from userspace would block, although normal steps and ioctl were done. This was because the diagchar_write() was blocked in wait_event_interruptible() but the diag_hsic_write_complete_callback() didn't wake it up. I simply added a wake_up_interruptible() in that callback and now the consecutive writes complete sucessfully. 4. At very high logging rates (on the Nexus 4 quad-core phone), it was observed that "count_hsic_write_pool" faced read/write race conditions (to increment and decrement) and became incorrect - even negative - and "diag_read_hsic_work_fn" would use it to "determine the current number of available buffers for writing after reading from the HSIC has completed" - so the code was modified to determine the number of entries available by iterating through "hsic_buf_tbl" instead. Also, added a "driver->num_clients--" to diagchar_close() to update num_clints on close() from userspace. Signed-off-by: Kasidit Yusuf <kasidit@xxxxxxxxxxx> --- drivers/char/diag/diagchar.h | 1 + drivers/char/diag/diagchar_core.c | 21 ++++++++++- drivers/char/diag/diagfwd.c | 16 ++++++++ drivers/char/diag/diagfwd_hsic.c | 79 ++++++++++++++++++++++++++++++++------- 4 files changed, 103 insertions(+), 14 deletions(-) diff --git a/drivers/char/diag/diagchar.h b/drivers/char/diag/diagchar.h index 8d87cf9..4281912 100644 --- a/drivers/char/diag/diagchar.h +++ b/drivers/char/diag/diagchar.h @@ -142,6 +142,7 @@ struct diagchar_dev { wait_queue_head_t wait_q; struct diag_client_map *client_map; int *data_ready; + int *data_ready_while_busy; int num_clients; int polling_reg_flag; struct diag_write_device *buf_tbl; diff --git a/drivers/char/diag/diagchar_core.c b/drivers/char/diag/diagchar_core.c index 2c572ed..9132fcb 100644 --- a/drivers/char/diag/diagchar_core.c +++ b/drivers/char/diag/diagchar_core.c @@ -182,6 +182,13 @@ static int diagchar_open(struct inode *inode, struct file *file) goto fail; else driver->data_ready = temp; + temp = krealloc(driver->data_ready_while_busy + , (driver->num_clients) * sizeof(int), + GFP_KERNEL); + if (!temp) + goto fail; + else + driver->data_ready_while_busy = temp; diag_add_client(i, file); } else { mutex_unlock(&driver->diagchar_mutex); @@ -221,6 +228,8 @@ static int diagchar_close(struct inode *inode, struct file *file) struct diagchar_priv *diagpriv_data = file->private_data; pr_debug("diag: process exit %s\n", current->comm); + driver->num_clients--; + if (!(file->private_data)) { pr_alert("diag: Invalid file pointer"); return -ENOMEM; @@ -804,6 +813,7 @@ static int diagchar_read(struct file *file, char __user *buf, size_t count, wait_event_interruptible(driver->wait_q, driver->data_ready[index]); + driver->data_ready_while_busy[index] = 0; mutex_lock(&driver->diagchar_mutex); if ((driver->data_ready[index] & USER_SPACE_LOG_TYPE) && (driver-> @@ -981,7 +991,16 @@ drop_hsic: /* copy number of data fields */ COPY_USER_SPACE_OR_EXIT(buf+4, num_data, 4); ret -= 4; - driver->data_ready[index] ^= USER_SPACE_LOG_TYPE; + + /* At very high logging rates, it was observed that + * sometimes diag_device_write() completed the "enqueue" + * and sets the data_ready flag during an ongoing read + * in the lines above - so we should not clear the flag + * in these cases + */ + if (driver->data_ready_while_busy[index] == 0) + driver->data_ready[index] ^= USER_SPACE_LOG_TYPE; + if (driver->ch) queue_work(driver->diag_wq, &(driver->diag_read_smd_work)); diff --git a/drivers/char/diag/diagfwd.c b/drivers/char/diag/diagfwd.c index 4419309..7f12a3f 100644 --- a/drivers/char/diag/diagfwd.c +++ b/drivers/char/diag/diagfwd.c @@ -332,6 +332,12 @@ int diag_device_write(void *buf, int proc_num, struct diag_request *write_ptr) driver->logging_process_id) break; if (i < driver->num_clients) { + /* Set a "data_ready_while_busy" flag to + * prevent diagchar_read() from clearing new + * data_ready sets that came during its ongoing read. + */ + driver->data_ready_while_busy[i] |= USER_SPACE_LOG_TYPE; + driver->data_ready[i] |= USER_SPACE_LOG_TYPE; pr_debug("diag: wake up logging process\n"); wake_up_interruptible(&driver->wait_q); @@ -2056,6 +2062,14 @@ void diagfwd_init(void) , GFP_KERNEL)) == NULL) goto err; kmemleak_not_leak(driver->data_ready); + if (driver->data_ready_while_busy == NULL) { + driver->data_ready_while_busy = kzalloc(driver->num_clients + * sizeof(int) + , GFP_KERNEL); + if (driver->data_ready_while_busy == NULL) + goto err; + } + kmemleak_not_leak(driver->data_ready_while_busy); if (driver->table == NULL && (driver->table = kzalloc(diag_max_reg* sizeof(struct diag_master_table), @@ -2166,6 +2180,7 @@ err: kfree(driver->client_map); kfree(driver->buf_tbl); kfree(driver->data_ready); + kfree(driver->data_ready_while_busy); kfree(driver->table); kfree(driver->pkt_buf); kfree(driver->write_ptr_1); @@ -2217,6 +2232,7 @@ void diagfwd_exit(void) kfree(driver->client_map); kfree(driver->buf_tbl); kfree(driver->data_ready); + kfree(driver->data_ready_while_busy); kfree(driver->table); kfree(driver->pkt_buf); kfree(driver->write_ptr_1); diff --git a/drivers/char/diag/diagfwd_hsic.c b/drivers/char/diag/diagfwd_hsic.c index bd8c52b..94d8b9d 100644 --- a/drivers/char/diag/diagfwd_hsic.c +++ b/drivers/char/diag/diagfwd_hsic.c @@ -45,6 +45,7 @@ static void diag_read_hsic_work_fn(struct work_struct *work) int num_reads_submitted = 0; int err = 0; int write_ptrs_available; + int i; if (!driver->hsic_ch) { pr_err("DIAG in %s: driver->hsic_ch == 0\n", __func__); @@ -55,12 +56,30 @@ static void diag_read_hsic_work_fn(struct work_struct *work) * Determine the current number of available buffers for writing after * reading from the HSIC has completed. */ - if (driver->logging_mode == MEMORY_DEVICE_MODE) - write_ptrs_available = driver->poolsize_hsic_write - - driver->num_hsic_buf_tbl_entries; - else + if (driver->logging_mode == MEMORY_DEVICE_MODE) { + /* + * At very high logging rates on multi-core phones, + * it was observed that "count_hsic_write_pool" faced read/write + * race conditions (to increment and decrement) and became + * incorrect - so don't use it to determine the number of + * entries available - iterate through "hsic_buf_tbl" instead. + */ + write_ptrs_available = 0; + i = (driver->poolsize_hsic_write) - 1; + for (; i >= 0; i--) + if (driver->hsic_buf_tbl[i].length == 0) + write_ptrs_available++; + } else { write_ptrs_available = driver->poolsize_hsic_write - driver->count_hsic_write_pool; + } + + if (driver->logging_mode == MEMORY_DEVICE_MODE && + (write_ptrs_available <= 0 || + write_ptrs_available > driver->poolsize_hsic_write)) { + pr_debug("diag: warning - probe write_ptrs_available: %d" + , write_ptrs_available); + } /* * Queue up a read on the HSIC for all available buffers in the @@ -138,17 +157,50 @@ static void diag_hsic_read_complete_callback(void *ctxt, char *buf, if (!buf) { pr_err("diag: Out of diagmem for HSIC\n"); } else { - /* - * Send data in buf to be written on the - * appropriate device, e.g. USB MDM channel + + /* Contradicting with the original comment above to send + * zero length entries, sending 0 length entries to the + * hsic_buf_tbl queue could cause it to never be + * properly dequeued and never "diagmem_free()". + * + * This is because diagfwd.c's diag_device_write() would + * successfully "enqueue" it but then diagchar_core.c's + * diagchar_read() would omit it - + * because "if (driver->hsic_buf_tbl[i].length > 0)"... + * So it won't be "dequeued" and also diagmem_free would + * not be called on it. + * + * Also, other diag_device_write() calls with non-0 + * length entries would also overwrite this 0-length + * entry because of the + * "if (driver->hsic_buf_tbl[i].length == 0)" check. + * + * Threrefore, if the actual_size is > 0, send it, + * but if the size is 0 then call diagmem_free + * on it and don't send it. */ - driver->write_len_mdm = actual_size; - err = diag_device_write((void *)buf, HSIC_DATA, NULL); - /* If an error, return buffer to the pool */ - if (err) { + + if (actual_size > 0) { + /* + * Send data in buf to be written on the + * appropriate device, e.g. USB MDM channel + */ + driver->write_len_mdm = actual_size; + err = diag_device_write((void *)buf, + HSIC_DATA, NULL); + /* If an error, return buffer to the pool */ + if (err) { + diagmem_free(driver, buf, + POOL_TYPE_HSIC); + pr_err("diag: In %s, error calling diag_device_write, err: %d\n", + __func__, err); + } + } else { + /* if we reach here this means actual_size is 0 + * then free and omit it. + */ + err = 0; diagmem_free(driver, buf, POOL_TYPE_HSIC); - pr_err("diag: In %s, error calling diag_device_write, err: %d\n", - __func__, err); } } } else { @@ -179,6 +231,7 @@ static void diag_hsic_write_complete_callback(void *ctxt, char *buf, { /* The write of the data to the HSIC bridge is complete */ driver->in_busy_hsic_write = 0; + wake_up_interruptible(&driver->wait_q); if (!driver->hsic_ch) { pr_err("DIAG in %s: driver->hsic_ch == 0\n", __func__); -- 1.8.1.5 -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html