On Fri, Mar 06, 2015 at 07:20:55PM -0500, Johannes Weiner wrote: > On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote: > > What we don't know is how many objects we might need to scan to find > > the objects we will eventually modify. Here's an (admittedly > > extreme) example to demonstrate a worst case scenario: allocate a > > 64k data extent. Because it is an exact size allocation, we look it > > up in the by-size free space btree. Free space is fragmented, so > > there are about a million 64k free space extents in the tree. > > > > Once we find the first 64k extent, we search them to find the best > > locality target match. The btree records are 16 bytes each, so we > > fit roughly 500 to a 4k block. Say we search half the extents to > > find the best match - i.e. we walk a thousand leaf blocks before > > finding the match we want, and modify that leaf block. > > > > Now, the modification removed an entry from the leaf and tht > > triggers leaf merge thresholds, so a merge with the 1002nd block > > occurs. That block now demand pages in and we then modify and join > > it to the transaction. Now we walk back up the btree to update > > indexes, merging blocks all the way back up to the root. We have a > > worst case size btree (5 levels) and we merge at every level meaning > > we demand page another 8 btree blocks and modify them. > > > > In this case, we've demand paged ~1010 btree blocks, but only > > modified 10 of them. i.e. the memory we consumed permanently was > > only 10 4k buffers (approx. 10 slab and 10 page allocations), but > > the allocation demand was 2 orders of magnitude more than the > > unreclaimable memory consumption of the btree modification. > > > > I hope you start to see the scope of the problem now... > > Isn't this bounded one way or another? Fo a single transaction? No. > Sure, the inaccuracy itself is > high, but when you put the absolute numbers in perspective it really > doesn't seem to matter: with your extreme case of 3MB per transaction, > you can still run 5k+ of them in parallel on a small 16G machine. No you can't. The number of concurrent transactions is bounded by the size of the log and the amount of unused space available for reservation in the log. Under heavy modification loads, that's usually somewhere between 15-25% of the log, so worst case is a few hundred megabytes. The memory reservation demand is in the same order of magnitude as the log space reservation demand..... > Occupy a generous 75% of RAM with anonymous pages, and you can STILL > run over a thousand transactions concurrently. That would seem like a > decent pipeline to keep the storage device occupied. Typical systems won't ever get to that - they don't do more than a handful of current transactions at a time - the "thousands of transactions" occur on dedicated storage servers like petabyte scale NFS servers that have hundreds of gigabytes of RAM and hundreds-to-thousands of processing threads to keep the request pipeline full. The memory in those machines is entirely dedicated to the filesystem, so keeping a usuable pool of a few gigabytes for transaction reservations isn't a big deal. The point here is that you're taking what I'm describing as the requirements of a reservation pool and then applying the worst case to situations where completely inappropriate. That's what I mean when I told Michal to stop building silly strawman situations; large amounts of concurrency are required for huge machines, not your desktop workstation. And, realistically, sizing that reservation pool appropriately is my problem to solve - it will depend on many factors, one of which is the actual geometry of the filesystem itself. You need to stop thinking like you can control how application use the memory allocation and reclaim subsystem and start to trust we will our memory usage appropriately to maintain maximum system throughput. After all, we already do that for all the filesystem caches the mm subsystem doesn't control - why do you think I have had such an interest in shrinker scalability? For XFS, the only cache we actually don't control reclaim from is user data in the page cache - we control everything else directly from custom shrinkers..... > The level of precision that you are asking for comes with complexity > and fragility that I'm not convinced is necessary, or justified. Look, if you dont think reservations will work, then how about you suggest something that will. I don't really care what you implement, as long as it meets the needs of demand paging, I have direct control over memory usage and concurrency policy and the allocation mechanism guarantees forward progress without needing the OOM killer. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>