Re: Include BlueFS DB space in total/used stats or not?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 10/22/2018 10:30 PM, Mark Nelson wrote:

On 10/22/18 2:12 PM, Sage Weil wrote:
On Mon, 22 Oct 2018, Mark Nelson wrote:
On 10/22/18 12:18 PM, Igor Fedotov wrote:

On 10/22/2018 7:49 PM, Sage Weil wrote:
On Mon, 22 Oct 2018, Igor Fedotov wrote:
Hi folks,

doing the last cleanup for https://github.com/ceph/ceph/pull/19454

I realized that we still include the space for separate DB volume into
total
(and hence treat it as used too) space reported by BlueStore.

It seems we had such a discussion a while ago but unfortunately I don't
recall
the results.


See BlueStore::statfs(...)

   if (bluefs) {
      // include dedicated db, too, if that isn't the shared device.
      if (bluefs_shared_bdev != BlueFS::BDEV_DB) {
        buf->total += bluefs->get_total(BlueFS::BDEV_DB);
      }

I'm not sure if there are any rationales behind that. And have a strong
desire
to remove it from 'total' calculation.

Just want to share two options for the new "ceph df' output.

3x OSD config: 10 Gb block device + 1Gb DB device + 1 Gb WAL device.

1) DB space included.

GLOBAL:
      SIZE       AVAIL      USED        RAW USED %RAW USED
      33 GiB     27 GiB     2.9 GiB      5.9 GiB 18.01
POOLS:
      NAME                  ID     STORED      OBJECTS USED
%USED     MAX
AVAIL
      cephfs_data_a         1          0 B 0         0 B
0       8.9 GiB
      cephfs_metadata_a     2      2.2 KiB 22     384 KiB
0       8.9 GiB

2) DB space isn't included.

GLOBAL:
      SIZE       AVAIL      USED        RAW USED %RAW USED
      30 GiB     27 GiB     1.9 MiB      3.0 GiB 10.01
POOLS:
      NAME                  ID     STORED      OBJECTS USED
%USED     MAX
AVAIL
      cephfs_data_a         1          0 B 0         0 B
0       8.9 GiB
      cephfs_metadata_a     2      2.2 KiB 22     384 KiB
0       8.9 GiB

So for the first case GLOBAL SIZE includes both block and DB devices
space.
RAW USED includes 3GB for separate BlueFS volumes and ~3GB permanently
reserved for BlueFS as slow device as per bluestore_bluefs_min_free
config
parameter. Not to mention space actually allocated for user data.

And AVAIL is equal to SIZE - RAW USED;

For the second option (I'm inclined to) all the numbers lack that DB
device
space and hence IMO provide more consistent picture.

Please note that 3x1GB allocated for WAL aren't taken into account in
both
cases.

So the question is what variant do we prefer? Are there any reasons to
account
for DB space here?

Doesn't it make sense to export BlueFS stats (total/avail) separately
which
(along with existing "internal_metadata" and "omap_allocated" fields)
allows
to build a perfect and consistent view of DB space usage if needed?
I think this is the key.  The problem is that the db space is 'available' space, but only for omap (or metadata).  I think the way to complete the
picture may be to start with (1), but then also expose separate
data_available and omap_available values?
Hence if going with (1) we get something like that one day (given that DB
uses 1GB):

GLOBAL:
      SIZE       AVAIL      DATA_AVAIL   USED        RAW USED     %RAW USED       33 GiB     29 GiB     27 GIB       2.9 GiB      5.9 GiB         18.01

Not very transparent IMO.
I'd prefer to split data and metadata usage and have something like the
following:

DATA:
      SIZE       AVAIL      USED        RAW USED     %RAW USED
      30 GiB     27 GiB     1.9 MiB      3.0 GiB 10.01
META:
      SIZE       AVAIL      USED        OMAP_USED  %USED
      6 GiB      5 GiB     1 GiB        256MIB     ZZZ

That would be fantastic!  I'm all for it.
I don't think it makes sense to separate it out completely like that.
The main part of the device can be consumed by either omap or data, so you
would end up including it in both the SIZE for data and meta.

I think it would be simpler (and more accurate) to show something
like:

SIZE    DATA_USED  DATA_AVAIL  OMAP_USED  OMAP_AVAIL    RAW USED     %RAW USED 33 GiB      1 GiB      27 GiB      6 GiB     2.9 GiB     8.1 GiB         38.01

or similar?

sage

On reflection the first questions I often get asked are "What's the difference between the DB and the WAL?" and "How big should the DB and WAL Partitions be?" Usually when people hear that the WAL is relatively small they lose interest in it, but a semi-common followup question is "How do I tell how full the DB partition is?"  I'm not sure how many of our users really understand or care about OMAP/onodes/blobs/extents/etc.  I think it's more about trying to verify that they've provision a reasonable amount of flash storage per node and they want us to give them some metric by which to define "reasonable".  Right now that metric seems to be "DB fits entirely on flash".

Yeah, and this makes me dream of 'df' output which shows whether "DB fits entirely on flash"


Mark






[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux