Re: Question: Modifying kernel to handle all I/O requests without page cache

Carlos Maiolino <cmaiolino@xxxxxxxxxx> · Fri, 27 Sep 2019 12:39:50 +0200

Hi.

I'm gonna move this question to the top, for a short answer:

> > But if you are trying to create benchmarks for a specific application, if your
> > benchmarks uses DIO or not, will depend on if the application uses DIO or not.
> 
> This is my main question. I want running an application without
> involving page caching effects even when the application does not
> support DIO.

You simply can't. Aligned IOs is a primitive of block devices (if I can use
these words). If you don't submit aligned IOs, you can't access block devices
directly.

You can't modify the kernel to do that either, because that's exactly one of the
goals of the buffer cache, other than improving performance of course. If you
submit an unaligned IO, kernel will first read in the whole sectors from the
block device, modify them accordingly to your unaligned IO and write the whole
sectors back.

For reads, the process is the same, kernel will read at least the whole sector,
never just a part of it.

Now, let me try a longer reply :P

> 
On Thu, Sep 26, 2019 at 06:42:43PM -0700, Jianshen Liu wrote:
> Hi Carlos,
> 
> Thanks for your reply.
> 
> On Thu, Sep 26, 2019 at 5:39 AM Carlos Maiolino <cmaiolino@xxxxxxxxxx> wrote:
> >
> > On Wed, Sep 25, 2019 at 03:51:27PM -0700, Jianshen Liu wrote:
> > > Hi,
> > >
> > > I am working on a project trying to evaluate the performance of a
> > > workload running on a storage device. I don't want the benchmark
> > > result depends on a specific platform (e.g., a platform with X GiB of
> > > physical memory).
> >
> > Well, this does not sound realistic to me. Memory is just only one of the
> > variables in how IO throughput will perform. You bypass memory, then, what about
> > IO Controller, Disks, Storage cache, etc etc etc? All of these are 'platform
> > specific'.
> 
> I apologize for any confusion because of my oversimplified project
> description. My final goal is to compare the efficiency of different
> platforms utilizing a specific storage device to run a given workload.
> Since the platforms can be heterogeneous (e.g., x86 vs arm), the
> comparison should be based on a reference unit that is relevant to the
> capability of the storage device but is irrelevant to a specific
> platform.

The storage vendors usually already provide you with the hardware limitations
which you can use as the reference units you are looking for. Like maximum IOPS
and Throughput such storage solution can support. These are all platform and
application independent reference units you can use.

> With this reference unit, you can understand how much
> performance a platform can give over the capability of the specific
> storage device.

Again, you can use the numbers provided by the vendor. For example, XFS is
designed to be a high-throughput filesystem, and the goal is to be as close as
possible to the hardware limits, but of course, it all depends on everything
else.

> Once you have this knowledge, you can consider whether
> you add/remove some CPUs, memory, the same model of storage devices,
> etc can improve the platform efficiency (e.g., cost/reference unit)
> with respect to the capability of the storage device under this
> workload.

Storage hardware limitations, vendor provided numbers also applies here. And you
can't simply discard application's behavior here. Everything you mentioned here
will be directly affected by the application you're using, so, modifying the
application will give you nothing useful to work on.

> Moreover, you can answer questions like can you get the full
> unit of performance when you add one more device onto the platform.

For "Full unit of performance", you can again, use vendor-provided numbers :)

> My question here is how to evaluate the platform-independent reference
> unit for the combination of a given workload and a specific storage
> device.

Use the application you are trying to evaluate, in different platforms, and
measure it.

> Specifically, the reference unit should be a performance value
> of the workload under the capability of the storage device. In other
> words, this value should not be either enhanced or throttled by the
> testing platform. Yes, memory is one of the variables affecting the
> I/O performance, the CPU horse, network bandwidth, type of host
> interface, version of the software would be the other. But these are
> the variables I can easily control. For example, I can check whether
> the CPU and/or the network are the performance bottlenecks. The I/O
> controller, storage media, and the disk cache are encapsulated in the
> storage device, so these are not platform-specific variables as long
> as I keep using the same model of the storage device. The use of page
> cache, however, may enhance the performance value making the value
> become platform-dependent.

Again, everything you measure, will have no meaning if you don't use realistic
data. You can't simply bypass the buffer cache if the application does not
support it, and so, it is pointless to measure how an application will 'perform'
in such scenario.

> I don't want to emulate a workload. An emulated workload will most of
> the time be different from the source real-world workload. For
> example, replaying block I/O recording results generated by fio or
> blktrace will probably get different performance numbers from running
> the original workload.

And I think this is the crux for your issue.

You don't want an emulated workload, because it may not reproduce the real-world
workload.

Why then are you trying to find a way to bypass the page/buffer cache, on an
application that will not support direct IO and won't be able to use it like
that?

You don't want to collect data using emulated workloads, but at the same time
you want to use something that is simply totally out of the reality? Does not
make any sense to me.

fio can get different performance numbers? Sure, I agree, no performance
measurement tool can beat the real workload of a specific application, but, what
you are trying to do doesn't either, so, what's the difference?

> > Benchmarking systems is an 'art', and I am certainly not an expert on it, but at
> > first, it looks like you are trying to create a 'generic benchmark' to some
> > generic random system. And I will tell you, this is not going to work well. We
> > have tons of cases and stories about people running benchmark X on system Z, and
> > it performing 'well', but when running their real workload, everything starts to
> > perform poorly, exactly because they did not use the correct benchmark at first.
> 
> I'm not trying to create a generic benchmark. I just want to create a
> benchmark methodology focusing on evaluating the efficiency of a
> platform for running a given workload on a specific storage device.

Ok, so, you want to evaluate how platform X will behave with your application +
storage.

Why then you want to modify that original platform behavior? In this case, let's
say, by bypassing Linux page/buffer cache.

By platform you mean hardware? Well, then, use the same software stack.

> 
> > You have several layers in a storage stack, which starts from how the
> > application handles its own IO requests. And each layer which will behave
> > differently on each type of workload.
> 
> My assumption is that we should run the same workload when comparing
> different platforms.

Yes, and if you don't want to use emulated workloads, you should don't try to
hack your software stack to behave in weird ways.

If you want to compare platforms, ensure to use the same software stack.
Including the same configuration. That's all.

> > If you are trying to measure an application performance on solution X, well,
> > it is pointless to measure direct IO if the application does not use it or
> > vice-versa, so, modifying an application, again, is not what you will want to do
> > for benchmarking, for sure.
> 
> The point is that I'm not trying to measure the performance of an
> application on solution X. I'm trying to generate a
> platform-independent reference unit for the combination of a storage
> device and the application.

You simply can't. Get any enterprise application out there, you will see the
application vendors usually certify certain combinations of hardware + software
stack.

There is a reason for that. There are many variables in the way, not only the
page/buffer cache. You can't simply bypass the page/buffer cache, and think
you'll get some realistic base reference unit you can work with. Specially if
you are not sure how the application behaves.

If you want to have base reference unit numbers for a storage solution, use the
vendor's reference numbers. They are platform agnostic. Everything else above
that will be totally interdependent.

> I have researched different knobs provided by the kernel including
> drop_caches, cgroup, and vm subsystem, but none of them can help me to
> measure what I want.

Because I honestly think what you are trying to measure is unrealistic :)

> I would like to know whether there is a variable
> in the filesystem that defines the size of the page cache pool.

There is no such silver bullet :)

> Also,
> would it be possible to convert some of the application IOs to DIO
> when they are properly aligned?

Not that I know about, but well, I'm not really an expert in the DIO code, maybe
there's a way to fall back to buffered io, although, I don't think so.

> Are there any places in the kernel I
> can easily change to bypass the page cache?

No.

-- 
Carlos