Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

David Lang <david@xxxxxxx> · Wed, 22 Jan 2014 18:46:11 -0800 (PST)

On Wed, 22 Jan 2014, Chris Mason wrote:

On Wed, 2014-01-22 at 11:50 -0800, Andrew Morton wrote:
On Wed, 22 Jan 2014 11:30:19 -0800 James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:

But this, I think, is the fundamental point for debate.  If we can pull
alignment and other tricks to solve 99% of the problem is there a need
for radical VM surgery?  Is there anything coming down the pipe in the
future that may move the devices ahead of the tricks?

I expect it would be relatively simple to get large blocksizes working
on powerpc with 64k PAGE_SIZE.  So before diving in and doing huge
amounts of work, perhaps someone can do a proof-of-concept on powerpc
(or ia64) with 64k blocksize.

Maybe 5 drives in raid5 on MD, with 4K coming from each drive.  Well
aligned 16K IO will work, everything else will about the same as a rmw
from a single drive.

I think this is the key point to think about here. How will these new hard drive 
large block sizes differ from RAID stripes and SSD eraseblocks?

In all of these cases there are very clear advantages to doing the writes in 
properly sized and aligned chunks that correspond with the underlying structure 
to avoid the RMW overhead.

It's extremely unlikely that drive manufacturers will produce drives that won't 
work with any existing OS, so they are going to support smaller writes in 
firmware. If they don't, they won't be able to sell their drives to anyone 
running existing software. Given the Enterprise software upgrade cycle compared 
to the expanding storage needs, whatever they ship will have to work on OS and 
firmware releases that happened several years ago.

I think what is needed is some way to be able to get a report on how man RMW 
cycles have to happen. Then people can work on ways to reduce this number and 
measure the results.

I don't know if md and dm are currently smart enough to realize that the entire 
stripe is being overwritten and avoid the RMW cycle. If they can't, I would 
expect that once we start measuring it, they will gain such support.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html