On Tue, Dec 1, 2015 at 8:58 AM, Avi Kivity <avi@xxxxxxxxxxxx> wrote: > > > On 12/01/2015 03:11 PM, Brian Foster wrote: >> >> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: >>> >>> On 11/30/2015 06:14 PM, Brian Foster wrote: >>>> >>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: >>>>> >>>>> On 11/30/2015 04:10 PM, Brian Foster wrote: >> >> ... >>>> >>>> The agsize/agcount mkfs-time heuristics change depending on the type of >>>> storage. A single AG can be up to 1TB and if the fs is not considered >>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the >>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is >>>> adjusted depending on the size of the overall volume (see >>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). >>> >>> We'll experiment with this. Surely it depends on more than the amount of >>> storage? If you have a high op rate you'll be more likely to excite >>> contention, no? >>> >> Sure. The absolute optimal configuration for your workload probably >> depends on more than storage size, but mkfs doesn't have that >> information. In general, it tries to use the most reasonable >> configuration based on the storage and expected workload. If you want to >> tweak it beyond that, indeed, the best bet is to experiment with what >> works. > > > We will do that. > >>>>> Are those locks held around I/O, or just CPU operations, or a mix? >>>> >>>> I believe it's a mix of modifications and I/O, though it looks like some >>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL >>>> pushing case will trylock and defer to the next list iteration if the >>>> buffer is busy. >>>> >>> Ok. For us sleeping in io_submit() is death because we have no other >>> thread >>> on that core to take its place. >>> >> The above is with regard to metadata I/O, whereas io_submit() is >> obviously for user I/O. > > > Won't io_submit() also trigger metadata I/O? Or is that all deferred to > async tasks? I don't mind them blocking each other as long as they let my > io_submit alone. > >> io_submit() can probably block in a variety of >> places afaict... it might have to read in the inode extent map, allocate >> blocks, take inode/ag locks, reserve log space for transactions, etc. > > > Any chance of changing all that to be asynchronous? Doesn't sound too hard, > if somebody else has to do it. > >> >> It sounds to me that first and foremost you want to make sure you don't >> have however many parallel operations you typically have running >> contending on the same inodes or AGs. Hint: creating files under >> separate subdirectories is a quick and easy way to allocate inodes under >> separate AGs (the agno is encoded into the upper bits of the inode >> number). > > > Unfortunately our directory layout cannot be changed. And doesn't this > require having agcount == O(number of active files)? That is easily in the > thousands. Actually, wouldn't agcount == O(nr_cpus) be good enough? > >> Reducing the frequency of block allocation/frees might also be >> another help (e.g., preallocate and reuse files, > > > Isn't that discouraged for SSDs? > > We can do that for a subset of our files. > > We do use XFS_IOC_FSSETXATTR though. > >> 'mount -o ikeep,' > > > Interesting. Our files are large so we could try this. > >> etc.). Beyond that, you probably want to make sure the log is large >> enough to support all concurrent operations. See the xfs_log_grant_* >> tracepoints for a window into if/how long transaction reservations might >> be waiting on the log. > > > I see that on an 400G fs, the log is 180MB. Seems plenty large for write > operations that are mostly large sequential, though I've no real feel for > the numbers. Will keep an eye on this. > > Thanks for all the info. > > >> Brian >> >>> _______________________________________________ >>> xfs mailing list >>> xfs@xxxxxxxxxxx >>> http://oss.sgi.com/mailman/listinfo/xfs > > _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs