Re: disk usage reported incorrectly

Igor Fedotov <ifedotov@xxxxxxx> · Wed, 17 Jul 2019 21:29:43 +0300

Fix is on its way too...

See https://github.com/ceph/ceph/pull/28978

On 7/17/2019 8:55 PM, Paul Mezzanini wrote:
Oh my.  That's going to hurt with 788 OSDs.   Time for some creative shell scripts and stepping through the nodes.  I'll report back.

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfmeec@xxxxxxx

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.
------------------------

________________________________________
From: Igor Fedotov <ifedotov@xxxxxxx>
Sent: Wednesday, July 17, 2019 11:33 AM
To: Paul Mezzanini; ceph-users@xxxxxxxxxxxxxx
Subject: Re:  disk usage reported incorrectly

Forgot to provide a workaround...

If that's the case then you need to repair each OSD with corresponding
command in ceph-objectstore-tool...

Thanks,

Igor.

On 7/17/2019 6:29 PM, Paul Mezzanini wrote:
Sometime after our upgrade to Nautilus our disk usage statistics went off the rails wrong.  I can't tell you exactly when it broke but I know that after the initial upgrade it worked at least for a bit.

Correct numbers should be something similar to: (These are copy/pasted from the autoscale-status report)

POOL    SIZE
cephfs_metadata     327.1G
cold-ec    98.36T
ceph-bulk-3r    142.6T
cephfs_data    31890G
ceph-hot-2r    5276G
kgcoe-cinder    103.2T
rbd   3098

Instead, we now show:

POOL     SIZE
cephfs_metadata    362.9G     (correct)
cold-ec    607.2G    (wrong)
ceph-bulk-3r    5186G (wrong)
cephfs_data    1654G (wrong)
ceph-hot-2r    5884G (correct I think)
kgcoe-cinder    5761G   (wrong)
rbd    128.0k

`ceph fs status` reports similar numbers.  cold-ec, ceph-hot-2r and cephfs_data are all cephfs data pools and cephfs_metadata is unsurprisingly, cephfs metadata.  The remaining pools are all used for rbd.

Interestingly, the `ceph df` outpool for raw storage feels correct for each drive class while the pool usage is wrong:

RAW STORAGE:
      CLASS         SIZE        AVAIL       USED        RAW USED     %RAW USED
      hdd           6.3 PiB     5.2 PiB     1.1 PiB      1.1 PiB         17.08
      nvme          175 TiB     161 TiB      14 TiB       14 TiB          7.82
      nvme-meta      14 TiB      11 TiB     2.2 TiB      2.5 TiB         18.45
      TOTAL         6.5 PiB     5.4 PiB     1.1 PiB      1.1 PiB         16.84

POOLS:
      POOL                ID     STORED      OBJECTS     USED        %USED     MAX AVAIL
      kgcoe-cinder        24     1.9 TiB      29.49M     5.6 TiB      0.32       582 TiB
      ceph-bulk-3r        32     1.7 TiB      88.28M     5.1 TiB      0.29       582 TiB
      cephfs_data         35     518 GiB     135.68M     1.6 TiB      0.09       582 TiB
      cephfs_metadata     36     363 GiB       5.63M     363 GiB      3.35       3.4 TiB
      rbd                 37       931 B           5     128 KiB         0       582 TiB
      ceph-hot-2r         50     5.7 TiB      18.63M     5.7 TiB      3.72        74 TiB
      cold-ec             51     417 GiB     105.23M     607 GiB      0.02       2.1 PiB

Everything is on "ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)" and kernel 5.0.21 or 5.0.9.  I'm actually doing the patching now to pull the ceph cluster up to 5.0.21, same as the clients.  I'm not really sure where to dig into this one.  Everything is working fine except disk usage reporting.  This also completely blows up the autoscaler.

I feel like the question is obvious but I'll state it anyway.  How do I get this issue resolved?

Thanks
-paul

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfmeec@xxxxxxx

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.
------------------------
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com