On Thu, May 19, 2022 at 10:27:27AM +1000, Dave Chinner wrote: > On Wed, May 18, 2022 at 11:55:13AM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > While running xfs/297 and generic/642, I noticed a crash in > > xfs_attri_item_relog when it tries to copy the attr name to the new > > xattri log item. I think what happened here was that we called > > ->iop_commit on the old attri item (which nulls out the pointers) as > > part of a log force at the same time that a chained attr operation was > > ongoing. The system was busy enough that at some later point, the defer > > ops operation decided it was necessary to relog the attri log item, but > > as we've detached the name buffer from the old attri log item, we can't > > copy it to the new one, and kaboom. > > > > I think there's a broader refcounting problem with LARP mode -- the > > setxattr code can return to userspace before the CIL actually formats > > and commits the log item, which results in a UAF bug. Therefore, the > > xattr log item needs to be able to retain a reference to the name and > > value buffers until the log items have completely cleared the log. > > Furthermore, each time we create an intent log item, we allocate new > > memory and (re)copy the contents; sharing here would be very useful. > > > > Solve the UAF and the unnecessary memory allocations by having the log > > code create a single refcounted buffer to contain the name and value > > contents. This buffer can be passed from old to new during a relog > > operation, and the logging code can (optionally) attach it to the > > xfs_attr_item for reuse when LARP mode is enabled. > > > > This also fixes a problem where the xfs_attri_log_item objects weren't > > being freed back to the same cache where they came from. > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > --- > > fs/xfs/libxfs/xfs_attr.h | 8 + > > fs/xfs/xfs_attr_item.c | 279 +++++++++++++++++++++++++++------------------- > > fs/xfs/xfs_attr_item.h | 13 +- > > 3 files changed, 182 insertions(+), 118 deletions(-) > > > > > > diff --git a/fs/xfs/libxfs/xfs_attr.h b/fs/xfs/libxfs/xfs_attr.h > > index 1af7abe29eef..17746dcc2268 100644 > > --- a/fs/xfs/libxfs/xfs_attr.h > > +++ b/fs/xfs/libxfs/xfs_attr.h > > @@ -501,6 +501,8 @@ enum xfs_delattr_state { > > { XFS_DAS_NODE_REMOVE_ATTR, "XFS_DAS_NODE_REMOVE_ATTR" }, \ > > { XFS_DAS_DONE, "XFS_DAS_DONE" } > > > > +struct xfs_attri_log_nameval; > > Ok, so this structure is: > > struct xfs_attri_log_nameval { > refcount_t anvl_refcount; > unsigned int anvl_name_len; > unsigned int anvl_value_len; > > /* name and value follow the end of this struct */ > }; > > If we are going to have a custom structure for this, let's actually > do it properly and not use magic pointers for the name/value > buffers. These are effectively iovecs, and we already do this > properly with log iovecs in shadow buffers. i.e: > > struct xfs_attri_log_nameval { > refcount_t refcount; > struct xfs_log_iovec name; > struct xfs_log_iovec value; > > /* name and value follow the end of this struct */ > }; > > When we set up the structure: > > nv.name.i_addr = (void *)&nv + 1; > nv.name.i_len = name_len; > memcpy(nv.name.i_addr, name, name_len); > if (value_len) { > nv.value.i_addr = nv.name.i_addr + nv.name.i_len; > nv.value.i_len = value_len; > memcpy(nv.value.i_addr, value, value_len); > } Yes, that makes a lot more sense. I'll convert the whole thing to use log iovecs and non-prefixed names. > And from that point onwards we only ever use the i_addr/i_len pairs > to refer to the name/values. We don't need pointer magic anywhere > but the setup code. > > (Also, "anvil"? Too clever by half and unnecessily verbose. It's a > just {name, value} pair, call it "nv"). Heh. :) > > /* > > * Defines for xfs_attr_item.xattri_flags > > */ > > @@ -512,6 +514,12 @@ enum xfs_delattr_state { > > struct xfs_attr_item { > > struct xfs_da_args *xattri_da_args; > > > > + /* > > + * Shared buffer containing the attr name and value so that the logging > > + * code can share large memory buffers between log items. > > + */ > > + struct xfs_attri_log_nameval *xattri_nameval; > > + > > /* > > * Used by xfs_attr_set to hold a leaf buffer across a transaction roll > > */ > > diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c > > index 4976b1ddc09f..7d4469e8a4fc 100644 > > --- a/fs/xfs/xfs_attr_item.c > > +++ b/fs/xfs/xfs_attr_item.c > > @@ -39,12 +39,91 @@ static inline struct xfs_attri_log_item *ATTRI_ITEM(struct xfs_log_item *lip) > > return container_of(lip, struct xfs_attri_log_item, attri_item); > > } > > > > +/* > > + * Shared xattr name/value buffers for logged extended attribute operations > > + * > > + * When logging updates to extended attributes, we can create quite a few > > + * attribute log intent items for a single xattr update. To avoid cycling the > > + * memory allocator and memcpy overhead, the name (and value, for setxattr) > > + * are kept in a refcounted object that is shared across all related log items > > + * and the upper-level deferred work state structure. The shared buffer has > > + * a control structure, followed by the name, and then the value. > > + */ > > + > > +static inline struct xfs_attri_log_nameval * > > +xfs_attri_log_nameval_get( > > + struct xfs_attri_log_nameval *anvl) > > +{ > > + BUG_ON(!refcount_inc_not_zero(&anvl->anvl_refcount)); > > + return anvl; > > No BUG_ON() error handling. > > ASSERT or WARN, return NULL on failure. Handle failure in the > caller. Fixed. > > +} > > + > > +static inline void > > +xfs_attri_log_nameval_put( > > + struct xfs_attri_log_nameval *anvl) > > +{ > > + if (refcount_dec_and_test(&anvl->anvl_refcount)) > > + kvfree(anvl); > > +} > > + > > +static inline void * > > +xfs_attri_log_namebuf( > > + struct xfs_attri_log_item *attrip) > > +{ > > + return attrip->attri_nameval + 1; > > I don't know what this points to just looking at this. Trying to > work out if it points to an address in the attri_nameval region or > whether it points to something in the attrip structure makes my head > hurt. I'd much prefer this be written as: > > return attrip->attri_nameval->name.i_addr; > > Because it explicitly encodes exactly what buffer we are returning > here. And with this, we probably don't even need the wrapper > function now. Yep, this is now removed. > > +} > > + > > +static inline void * > > +xfs_attri_log_valbuf( > > + struct xfs_attri_log_item *attrip) > > +{ > > + struct xfs_attri_log_nameval *anvl = attrip->attri_nameval; > > + > > + if (anvl->anvl_value_len == 0) > > + return NULL; > > + > > + return (char *)xfs_attri_log_namebuf(attrip) + anvl->anvl_name_len; > > +} > > And this becomes: > > return attrip->attri_nameval->value.i_addr; > > Because it is initialised to NULL if there is no value. > > I think the need for this wrapper goes away, too. Also removed. > > + > > +static inline struct xfs_attri_log_nameval * > > +xfs_attri_log_nameval_alloc( > > + const void *name, > > + unsigned int name_len, > > + const void *value, > > + unsigned int value_len) > > +{ > > + struct xfs_attri_log_nameval *anvl; > > + void *p; > > + > > + /* > > + * This could be over 64kB in length, so we have to use kvmalloc() for > > + * this. But kvmalloc() utterly sucks, so we use our own version. > > + */ > > + anvl = xlog_kvmalloc(sizeof(struct xfs_attri_log_nameval) + > > + name_len + value_len); > > + if (!anvl) > > + return anvl; > > + > > + anvl->anvl_name_len = name_len; > > + anvl->anvl_value_len = value_len; > > I'm really starting to dislike all the 4 and 5 letter variable name > prefixs in the new attr code. They are all just repeating the > variable name and so are largely redundant and makes the code > very verbose for no good reason. This: > > anvl->name_len = name_len; > anvl->value_len = value_len; > > is better because it is shorter and conveys exactly the same > information to the reader. <nod> > > + p = anvl + 1; > > + memcpy(p, name, name_len); > > + if (value_len) { > > + p += name_len; > > + memcpy(p, value, value_len); > > + } > > + refcount_set(&anvl->anvl_refcount, 1); > > + return anvl; > > +} > > + > > STATIC void > > xfs_attri_item_free( > > struct xfs_attri_log_item *attrip) > > { > > kmem_free(attrip->attri_item.li_lv_shadow); > > - kvfree(attrip); > > + if (attrip->attri_nameval) > > + xfs_attri_log_nameval_put(attrip->attri_nameval); > > Handle the NULL inside xfs_attri_log_nameval_put(). Fixed. > .... > > > @@ -108,22 +189,22 @@ xfs_attri_item_format( > > * the log recovery. > > */ > > > > - ASSERT(attrip->attri_name_len > 0); > > + ASSERT(anvl->anvl_name_len > 0); > > attrip->attri_format.alfi_size++; > > > > - if (attrip->attri_value_len > 0) > > + if (anvl->anvl_value_len > 0) > > attrip->attri_format.alfi_size++; > > > > xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTRI_FORMAT, > > &attrip->attri_format, > > sizeof(struct xfs_attri_log_format)); > > xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTR_NAME, > > - attrip->attri_name, > > - attrip->attri_name_len); > > - if (attrip->attri_value_len > 0) > > + xfs_attri_log_namebuf(attrip), > > + anvl->anvl_name_len); > > Yeah, these wrappers to retrieve the actual buffer need to die. > > FWIW, If we've already got the name encoded in a log iovec, add a > xlog_copy_from_iovec() method that handles empty iovecs and just > replace all this code with: > > xlog_copy_from_iovec(lv, &vecp, nv->name); > xlog_copy_from_iovec(lv, &vecp, nv->value); > > This is the first step towards sharing the NV buffer deep into the > CIL commit code so the shadow buffer doesn't need to allocate > another 64kB and copy that 64kB every time we log a new intent. > > Not necessary for this patch, but I really want start in that > direction to get away from magic buffers so that the future > optimisations we already know we need to make are easier to do. I agree, structured, uh, structs seems like a better course to take in the long run. > > + if (anvl->anvl_value_len > 0) > > xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTR_VALUE, > > - attrip->attri_value, > > - attrip->attri_value_len); > > + xfs_attri_log_valbuf(attrip), > > + anvl->anvl_value_len); > > } > > > > /* > > @@ -158,41 +239,17 @@ xfs_attri_item_release( > > STATIC struct xfs_attri_log_item * > > xfs_attri_init( > > struct xfs_mount *mp, > > - uint32_t name_len, > > - uint32_t value_len) > > - > > + struct xfs_attri_log_nameval *anvl) > > { > > struct xfs_attri_log_item *attrip; > > - uint32_t buffer_size = name_len + value_len; > > > > - if (buffer_size) { > > - /* > > - * This could be over 64kB in length, so we have to use > > - * kvmalloc() for this. But kvmalloc() utterly sucks, so we > > - * use own version. > > - */ > > - attrip = xlog_kvmalloc(sizeof(struct xfs_attri_log_item) + > > - buffer_size); > > - } else { > > - attrip = kmem_cache_alloc(xfs_attri_cache, > > - GFP_NOFS | __GFP_NOFAIL); > > - } > > - memset(attrip, 0, sizeof(struct xfs_attri_log_item)); > > + attrip = kmem_cache_zalloc(xfs_attri_cache, GFP_NOFS | __GFP_NOFAIL); > > > > - attrip->attri_name_len = name_len; > > - if (name_len) > > - attrip->attri_name = ((char *)attrip) + > > - sizeof(struct xfs_attri_log_item); > > - else > > - attrip->attri_name = NULL; > > - > > - attrip->attri_value_len = value_len; > > - if (value_len) > > - attrip->attri_value = ((char *)attrip) + > > - sizeof(struct xfs_attri_log_item) + > > - name_len; > > - else > > - attrip->attri_value = NULL; > > + /* > > + * Grab an extra reference to the name/value buffer for this log item. > > + * The caller retains its own reference! > > + */ > > + attrip->attri_nameval = xfs_attri_log_nameval_get(anvl); > > Handle _get failure here, or better, pass in an already referenced > nv because the caller should always have a reference to begin > with. Caller can probably handle allocation failure, too, because > this should be called before we've dirtied the transaction, right? _nameval_get merely bumps the refcount, so the caller should already have a valid reference to begin with. So I think the _get function can become: if (!refcount_inc_not_zero(anvl..)) return NULL; return val; and then this callsite can add a simple ASSERT to ensure that we never pass around a zero-refcount object (in theory the refcount code will also fail loudly): attrip->attri_nameval = xfs_attri_log_nameval_get(anvl); ASSERT(attrip->attri_nameval); > ..... > > > @@ -385,16 +435,33 @@ xfs_attr_create_intent( > > * Each attr item only performs one attribute operation at a time, so > > * this is a list of one > > */ > > - list_for_each_entry(attr, items, xattri_list) { > > - attrip = xfs_attri_init(mp, attr->xattri_da_args->namelen, > > - attr->xattri_da_args->valuelen); > > - if (attrip == NULL) > > - return NULL; > > + attr = list_first_entry_or_null(items, struct xfs_attr_item, > > + xattri_list); > > > > - xfs_trans_add_item(tp, &attrip->attri_item); > > - xfs_attr_log_item(tp, attrip, attr); > > + /* > > + * Create a buffer to store the attribute name and value. This buffer > > + * will be shared between the higher level deferred xattr work state > > + * and the lower level xattr log items. > > + */ > > + if (!attr->xattri_nameval) { > > + struct xfs_da_args *args = attr->xattri_da_args; > > + > > + /* > > + * Transfer our reference to the name/value buffer to the > > + * deferred work state structure. > > + */ > > + attr->xattri_nameval = xfs_attri_log_nameval_alloc(args->name, > > + args->namelen, args->value, args->valuelen); > > + } > > + if (!attr->xattri_nameval) { > > + xlog_force_shutdown(mp->m_log, SHUTDOWN_LOG_IO_ERROR); > > + return NULL; > > } > > Why shutdown on ENOMEM? I would have expects that we return to the > caller so it can cancel the transaction. That way we only shut down > if the transaction is dirty or the caller context can't handle > errors.... ->create_intent can return NULL if they don't need to log any intent items to restart a piece of deferred work. xfs_defer_create_intent*() currently have no means to convey an errno up to their callers, but that could be a preparatory cleanup. > > > > + attrip = xfs_attri_init(mp, attr->xattri_nameval); > > + xfs_trans_add_item(tp, &attrip->attri_item); > > + xfs_attr_log_item(tp, attrip, attr); > > + > > return &attrip->attri_item; > > } > > > > @@ -404,6 +471,8 @@ xfs_attr_free_item( > > { > > if (attr->xattri_da_state) > > xfs_da_state_free(attr->xattri_da_state); > > + if (attr->xattri_nameval) > > + xfs_attri_log_nameval_put(attr->xattri_nameval); > > Handle the null inside xfs_attri_log_nameval_put(). Fixed. > > @@ -567,10 +615,17 @@ xfs_attri_item_recover( > > attr->xattri_op_flags = attrp->alfi_op_flags & > > XFS_ATTR_OP_FLAGS_TYPE_MASK; > > > > + /* > > + * We're reconstructing the deferred work state structure from the > > + * recovered log item. Grab a reference to the name/value buffer and > > + * attach it to the new work state. > > + */ > > + attr->xattri_nameval = xfs_attri_log_nameval_get(anvl); > > and handle NULL here. Done. --D > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx