On Thu, Aug 09, 2018 at 08:59:59AM -0700, Darrick J. Wong wrote: > On Thu, Aug 09, 2018 at 08:00:28AM -0400, Brian Foster wrote: > > On Wed, Aug 08, 2018 at 03:42:32PM -0700, Darrick J. Wong wrote: > > > On Wed, Aug 08, 2018 at 08:29:54AM -0400, Brian Foster wrote: > > > > On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote: > > > > > On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote: > > > > > > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote: > > > > > > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote: > > > > > > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote: > > > > > > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote: > > > > > > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote: > > > > > > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote: > > > > > > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote: > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote: > > > > > > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote: ... > > > > What _seems_ beneficial about that approach is we get (potentially > > > > external) persistent backing and memory reclaim ability with the > > > > traditional memory allocation model. > > > > > > > > ISTM that if we used a regular file, we'd need to deal with the > > > > traditional file interface somehow or another (file read/pagecache > > > > lookup -> record ??). > > > > > > Yes, that's all neatly wrapped up in kernel_read() and kernel_write() so > > > all we need is a (struct file *). > > > > > > > We could repurpose some existing mechanism like the directory code or > > > > quota inode mechanism to use xfs buffers for that purpose, but I think > > > > that would require us to always use an internal inode. Allowing > > > > userspace to pass an fd/file passes that consideration on to the user, > > > > which might be more flexible. We could always warn about additional > > > > limitations if that fd happens to be based on the target fs. > > > > > > <nod> A second advantage of the struct file/kernel_{read,write} approach > > > is that we if we ever decide to let userspace pass in a fd, it's trivial > > > to feed that struct file to the kernel io routines instead of a memfd > > > one. > > > > > > > Yeah, I like this flexibility. In fact, I'm wondering why we wouldn't do > > something like this anyways. Could/should xfs_scrub be responsible for > > allocating a memfd and passing along the fd? Another advantage of doing > > that is whatever logic we may need to clean up old repair files or > > whatever is pushed to userspace. > > There are two ways we could do this -- one is to have the kernel manage > the memfd creation internally (like my patches do now); the other is for > xfs_scrub to pass in creat(O_TMPFILE). > > When repair fputs the file (or fdputs the fd if we switch to using > that), the kernel will perform the usual deletion of the zero-linkcount > zero-refcount file. We get all the "cleanup" for free by closing the > file. > Ok. FWIW, the latter approach where xfs_scrub creates a file and passes the fd along to the kernel seems preferable to me, but perhaps others have different opinions. We could accept a pathname from the user to create the file or otherwise attempt to allocate an memfd by default and pass that along. > One other potential complication is that a couple of the repair > functions need two memfds. The extended attribute repair creates a > fixed-record array for attr keys and an xblob to hold names and values; > each structure gets its own memfd. The refcount repair creates two > fixed-record arrays, one for refcount records and another to act as a > stack of rmaps to compute reference counts. > Hmm, I guess there's nothing stopping scrub from passing in two fds. Maybe it would make more sense for the userspace option to be a path basename or directory where scrub is allowed to create whatever scratch files it needs. That aside, is there any reason the repair mechanism couldn't emulate multiple files with a single fd via a magic offset delimeter or something? E.g., "file 1" starts at offset 0, "file 2" starts at offset 1TB, etc. (1TB is probably overkill, but you get the idea..). Brian > (In theory the xbitmap could also be converted to use the fixed record > array, but in practice they haven't (yet) become large enough to warrant > it, and there's currently no way to insert or delete records from the > middle of the array.) > > > > > > > I'm not familiar with memfd. The manpage suggests it's ram backed, is it > > > > > > swappable or something? > > > > > > > > > > It's supposed to be. The quick test I ran (allocate a memfd, write 1GB > > > > > of junk to it on a VM with 400M of RAM) seemed to push about 980MB into > > > > > the swap file. > > > > > > > > > > > > > Ok. > > > > > > > > > > If so, that sounds a reasonable option provided the swap space > > > > > > requirement can be made clear to users > > > > > > > > > > We can document it. I don't think it's any worse than xfs_repair being > > > > > able to use up all the memory + swap... and since we're probably only > > > > > going to be repairing one thing at a time, most likely scrub won't need > > > > > as much memory. > > > > > > > > > > > > > Right, but as noted below, my concerns with the xfs_repair comparison > > > > are that 1.) the kernel generally has more of a limit on anonymous > > > > memory allocations than userspace (i.e., not swappable AFAIU?) and 2.) > > > > it's not clear how effectively running the system out of memory via the > > > > kernel will behave from a failure perspective. > > > > > > > > IOW, xfs_repair can run the system out of memory but for the most part > > > > that ends up being a simple problem for the system: OOM kill the bloated > > > > xfs_repair process. For an online repair in a similar situation, I have > > > > no idea what's going to happen. > > > > > > Back in the days of the huge linked lists the oom killer would target > > > other proceses because it doesn't know that the online repair thread is > > > sitting on a ton of pinned kernel memory... > > > > > > > Makes sense, kind of what I'd expect... > > > > > > The hope is that the online repair hits -ENOMEM and unwinds, but ISTM > > > > we'd still be at risk of other subsystems running into memory > > > > allocation problems, filling up swap, the OOM killer going after > > > > unrelated processes, etc. What if, for example, the OOM killer starts > > > > picking off processes in service to a running online repair that > > > > immediately consumes freed up memory until the system is borked? > > > > > > Yeah. One thing we /could/ do is register an oom notifier that would > > > urge any running repair threads to bail out if they can. It seems to me > > > that the oom killer blocks on the oom_notify_list chain, so our handler > > > could wait until at least one thread exits before returning. > > > > > > > Ok, something like that could be useful. I agree that we probably don't > > need to go that far until the mechanism is nailed down and testing shows > > that OOM is a problem. > > It already is a problem on my contrived "2TB hardlink/reflink farm fs" + > "400M of RAM and no swap" scenario. Granted, pretty much every other > xfs utility also blows out on that so I'm not sure how hard I really > need to try... > > > > > I don't know how likely that is or if it really ends up much different > > > > from the analogous xfs_repair situation. My only point right now is > > > > that failure scenario is something we should explore for any solution > > > > we ultimately consider because it may be an unexpected use case of the > > > > underlying mechanism. > > > > > > Ideally, online repair would always be the victim since we know we have > > > a reasonable fallback. At least for memfd, however, I think the only > > > clues we have to decide the question "is this memfd getting in the way > > > of other threads?" is either seeing ENOMEM, short writes, or getting > > > kicked by an oom notification. Maybe that'll be enough? > > > > > > > Hm, yeah. It may be challenging to track memfd usage as such. If > > userspace has access to the fd on an OOM notification or whatever, it > > might be able to do more accurate analysis based on an fstat() or > > something. > > > > Related question... is the online repair sequence currently > > interruptible, if xfs_scrub receives a fatal signal while pulling in > > entries during an allocbt scan for example? > > It's interruptible (fatal signals only) during the scan phase, but once > it starts logging metadata updates it will run all the way to > completion. > > > > > (To the contrary, just using a cached file seems a natural fit from > > > > that perspective.) > > > > > > Same here. > > > > > > > > > and the failure characteristics aren't more severe than for userspace. > > > > > > An online repair that puts the broader system at risk of OOM as > > > > > > opposed to predictably failing gracefully may not be the most useful > > > > > > tool. > > > > > > > > > > Agreed. One huge downside of memfd seems to be the lack of a mechanism > > > > > for the vm to push back on us if we successfully write all we need to > > > > > the memfd but then other processes need some memory. Obviously, if the > > > > > memfd write itself comes up short or fails then we dump the memfd and > > > > > error back to userspace. We might simply have to free array memory > > > > > while we iterate the records to minimize the time spent at peak memory > > > > > usage. > > > > > > > > > > > > > Hm, yeah. Some kind of fixed/relative size in-core memory pool approach > > > > may simplify things because we could allocate it up front and know right > > > > away whether we just don't have enough memory available to repair. > > > > > > Hmm. Apparently we actually /can/ call fallocate on memfd to grab all > > > the pages at once, provided we have some guesstimate beforehand of how > > > much space we think we'll need. > > > > > > So long as my earlier statement about the memory requirements being no > > > more than the size of the btree leaves is actually true (I haven't > > > rigorously tried to prove it), we need about (xrep_calc_ag_resblks() * > > > blocksize) worth of space in the memfd file. Maybe we ask for 1.5x > > > that and if we don't get it, we kill the memfd and exit. > > > > > > > Indeed. It would be nice if we could do all of the file management bits > > in userspace. > > Agreed, though no file management would be even better. :) > > --D > > > Brian > > > > > --D > > > > > > > > > > > Brian > > > > > > > > > --D > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > +done: > > > > > > > > > > > > > > > + /* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */ > > > > > > > > > > > > > > > + xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG); > > > > > > > > > > > > > > > + return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo, > > > > > > > > > > > > > > > + XFS_AG_RESV_NONE); > > > > > > > > > > > > > > > +} > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c > > > > > > > > > > > > > > > index 0ed68379e551..82f99633a597 100644 > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c > > > > > > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp( > > > > > > > > > > > > > > > diff = b1->bno - b2->bno; > > > > > > > > > > > > > > > return diff; > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > +/* Are there any busy extents in this AG? */ > > > > > > > > > > > > > > > +bool > > > > > > > > > > > > > > > +xfs_extent_busy_list_empty( > > > > > > > > > > > > > > > + struct xfs_perag *pag) > > > > > > > > > > > > > > > +{ > > > > > > > > > > > > > > > + spin_lock(&pag->pagb_lock); > > > > > > > > > > > > > > > + if (pag->pagb_tree.rb_node) { > > > > > > > > > > > > > > > > > > > > > > > > > > > > RB_EMPTY_ROOT()? > > > > > > > > > > > > > > > > > > > > > > > > > > Good suggestion, thank you! > > > > > > > > > > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + spin_unlock(&pag->pagb_lock); > > > > > > > > > > > > > > > + return false; > > > > > > > > > > > > > > > + } > > > > > > > > > > > > > > > + spin_unlock(&pag->pagb_lock); > > > > > > > > > > > > > > > + return true; > > > > > > > > > > > > > > > +} > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h > > > > > > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644 > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h > > > > > > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list) > > > > > > > > > > > > > > > list_sort(NULL, list, xfs_extent_busy_ag_cmp); > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag); > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > #endif /* __XFS_EXTENT_BUSY_H__ */ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > -- > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > -- > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > -- > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > -- > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > -- > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html