On Wed, Sep 06, 2017 at 08:49:28AM -0400, Brian Foster wrote: > On Wed, Sep 06, 2017 at 10:12:01PM +1000, Dave Chinner wrote: > > On Wed, Sep 06, 2017 at 07:43:05AM -0400, Brian Foster wrote: > > > That said, while the implementation improvement makes sense, I'm still > > > not necessarily convinced that this has a place in the upstream realtime > > > feature. I'll grant you that I'm not terribly familiar with the > > > historical realtime use case.. Dave, do you see value in such a > > > heuristic as it relates to the realtime feature (not this tiering > > > setup)? Is there necessarily a mapping between a large file size and a > > > file that should be tagged realtime? > > > > I don't see it much differently to the inode32 allocator policy. > > That separates metadata from data based on the type of allocation > > that is going to take place. inode32 decides on the AG for the > > inode data on the first data allocation (via the ag rotor), so > > there's already precedence for this sort of "locality selection at > > initial allocation" policy in the XFS allocation algorithms. > > > > Some workloads run really well on inode32 because the metadata ends > > up tightly packed and you can keep lots of disks busy with a dm > > concat because data IO is effectively distributed over all AGs. > > We've never done that automatically with the rt device before, but > > if it allows hybrid setups to be constructed easily then I can see > > it being beneficial to those same sorts of worklaods.... > > > > And, FWIW, auto rtdev selection might also work quite nicely with > > write once large file workloads (i.e. archives) on SMR drives - data > > device for the PMR region for metadata and small or temporary files, > > rt device w/ appropriate extent size for larges files in the SMR > > region... > > > > Ok, that sounds reasonable enough to me. Thanks. > > > > E.g., I suppose somebody who is > > > using traditional realtime (i.e., no SSD) and has a mix of legitimate > > > realtime (streaming media) files and large sparse virt disk images or > > > something of that nature would need to know to not use this feature > > > (i.e., this requires documentation)..? > > > > It wouldn't be enabled by default. We can't break existing rt device > > setups, so I don't see any issue here. And, well, someone mixing > > realtime and sparse virt in the same filesystem and storage isn't > > going to get reliable realtime response. i.e. nobody in their right > > mind mixes realtime streaming workloads with anything else - it's > > always dedicated hardware for RT.... > > > > Yes, that's just a dumb example. Let me rephrase... > > Is there any legitimate realtime use case a filesystem may not want to > tag all files of a particular size? E.g., this is more relevant for > subsequent read requirements than anything, right? (If not, then why do > we have the flag at all?) If so, then it seems to me this needs to be > clearly documented... Hmmm. I'm not following you here, Brian. RT has a deterministic allocator to prevent arbitrary IO delays on write, not read. The read side on RT is no different to the data device (i.e. extent lookup, read data) and as long as both allocators have given the file large contiguous extents there's no difference in the size and shape of read IOs being issued, either. So I'm not sure what you are saying needs documenting? Also, keep in mind the RT device is not suited to small files at all. It's optimised for allocating large contiguous extents, it doesn't handle freespace fragmentation at all well so having small files come and go regularly really screws it up, and it's single threaded allocator means it can't handle the allocation demand that comes along with small file workloads, either..... > Note that this use case defines large as >256k. Realtime use cases may > have a much different definition, yes? Again, if the workload is "realtime"(*) then it is not going to be using this functionality - everything needs to be tightly controlled and leave nothing to unpredictable algorithmic heuristics. Regardless, for different *data sets* the size threshold might be different, but that is for the admin who understands the environment and applications to separate workloads and set appropriate policy for each. If you're only worried about it being a fs global setting, then start thinking about how to do it per inode/directory. Personally, though, I think we need to start moving all the allocation policy stuff (extsize hints, flags, etc) into a generic alloc policy xattr space otherwise we're going to run out of space in the inode core for all this alloc policy stuff... > I take it that means things like amount of physical memory and > write workload may also be a significant factor in the > effectiveness of this heuristic. For example, how much pagecache > can we dirty before writeback occurs and does an initial > allocation? How many large files are typically written in parallel? Delayed allocation on large works just fine regardless of these parameter variations - that's the whole point of all the heuristics in the delalloc code to prevent fragmentation. IOWs, machine loading and worklaod should not significantly impact on what device large files are written to because it's rare that large files get allocated in tiny chunks by XFS. Where mistakes are made, xfs_fsr can relocate the files appropriately. And the good part about having the metadata on SSD is that the xfs_fsr scan to fnd such files (i.e bulkstat) won't impact on the running workload significantly. > Also, what about direct I/O or extent size hints? If you are doing direct IO, then it's up to the admin and application to make sure it it's not doing something silly. The usual raft of alloc policy controls like extent size hints, preallocation and/or actually setting the rt inherit bits manually on data set directories can deal with issues here... > All I'm really saying is that I think this at least needs to consider > the generic use case and have some documentation around any scnarios > where this might not make sense for traditional users, what values might > be sane, etc. I think you're conflating "integrating new functionality in a generic manner" with "this is new generic functionality everyone should use". CRCs and reflink fall into the latter category, while allocation policies for rtdevs fall into the former.... > As opposed to such users seeing an "automagic" knob, > turning it on thinking it replaces the need to think about how to > properly lay out the fs and then realizing later that this doesn't do > what they expect. Thoughts? ISTM that you are over-thinking the problem. :/ We should document how something can/should be used, not iterate all the cases where it should not be used because they vastly outnumber the valid use cases. I can see how useful a simple setup like Richard has described is for effcient long term storage in large scale storage environments. I think we should aim to support that cleanly and efficiently first, not try to make it into something that nobody is asking for.... Cheers, Dave. (*) <rant warning> The "realtime" device isn't real time at all. It's a shit name and I hate it because it makes people think it's something that it isn't. It's just an alternative IO address space with a bound overhead (i.e. deterministic) allocator that is optimised for large contiguous data allocations. It's used for workloads are that are latency sensitive, not "real time". The filesystem is not real time capable and the IO subsystem is most definitely not real time capable. It's a crap name. <end rant> -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html