Re: Implementing NVMHCI...

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Sun, 12 Apr 2009 17:23:54 +0000

On Sun, 2009-04-12 at 08:41 -0700, Linus Torvalds wrote:
> 
> On Sun, 12 Apr 2009, Szabolcs Szakacsits wrote:
> > 
> > I did not hear about NTFS using >4kB sectors yet but technically 
> > it should work.
> > 
> > The atomic building units (sector size, block size, etc) of NTFS are 
> > entirely parametric. The maximum values could be bigger than the 
> > currently "configured" maximum limits. 
> 
> It's probably trivial to make ext3 support 16kB blocksizes (if it doesn't 
> already).
> 
> That's not the problem. The "filesystem layout" part is just a parameter.
> 
> The problem is then trying to actually access such a filesystem, in 
> particular trying to write to it, or trying to mmap() small chunks of it. 
> The FS layout is the trivial part.
> 
> > At present the limits are set in the BIOS Parameter Block in the NTFS
> > Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for 
> > "Sectors Per Block". So >4kB sector size should work since 1993.
> > 
> > 64kB+ sector size could be possible by bootstrapping NTFS drivers 
> > in a different way. 
> 
> Try it. And I don't mean "try to create that kind of filesystem". Try to 
> _use_ it. Does Window actually support using it it, or is it just a matter 
> of "the filesystem layout is _specified_ for up to 64kB block sizes"?
> 
> And I really don't know. Maybe Windows does support it. I'm just very 
> suspicious. I think there's a damn good reason why NTFS supports larger 
> block sizes in theory, BUT EVERYBODY USES A 4kB BLOCKSIZE DESPITE THAT!
> 
> Because it really is a hard problem. It's really pretty nasty to have your 
> cache blocking be smaller than the actual filesystem blocksize (the other 
> way is much easier, although it's certainly not pleasant either - Linux 
> supports it because we _have_ to, but sector-size of hardware had 
> traditionally been 4kB, I'd certainly also argue against adding complexity 
> just to make it smaller, the same way I argue against making it much 
> larger).
> 
> And don't get me wrong - we could (fairly) trivially make the 
> PAGE_CACHE_SIZE be bigger - even eventually go so far as to make it a 
> per-mapping thing, so that you could have some filesystems with that 
> bigger sector size and some with smaller ones. I think Andrea had patches 
> that did a fair chunk of it, and that _almost_ worked.
> 
> But it ABSOLUTELY SUCKS. If we did a 16kB page-cache-size, it would 
> absolutely blow chunks. It would be disgustingly horrible. Putting the 
> kernel source tree on such a filesystem would waste about 75% of all 
> memory (the median size of a source file is just about 4kB), so your page 
> cache would be effectively cut in a quarter for a lot of real loads.
> 
> And to fix up _that_, you'd need to now do things like sub-page 
> allocations, and now your page-cache size isn't even fixed per filesystem, 
> it would be per-file, and the filesystem (and the drievrs!) would hav to 
> handle the cases of getting those 4kB partial pages (and do r-m-w IO after 
> all if your hardware sector size is >4kB).

We might not have to go that far for a device with these special
characteristics.  It should be possible to build a block size remapping
Read Modify Write type device to present a 4k block size to the OS while
operating in n*4k blocks for the device.  We could implement the read
operations as readahead in the page cache, so if we're lucky we mostly
end up operating on full n*4k blocks anyway.  For the cases where we've
lost pieces of the n*4k native block and we have to do a write, we'd
just suck it up and do a read modify write on a separate memory area, a
bit like the new 4k sector devices do emulating 512 byte blocks.  The
suck factor of this double I/O plus memory copy overhead should be
mitigated partially by the fact that the underlying device is very fast.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html