Disks are filling up

Omar Siam <Omar.Siam@xxxxxxxxxx> · Mon, 24 Apr 2023 15:55:41 +0200

Hi list,

we created a cluster for using cephfs with a kubernetes cluster. Since a 
few weeks now the cluster keeps filling up at an alarming rate
(100 GB per day).
This is while the most relevant pg is deep scrubbing and was interupted 
a few times.

We use about 150G (du using the mounted filesystem) on the cephfs 
filesystem and try not to use snapshots (.snap directories "exist" but 
are empty).
We do not understand why the pgs get bigger and bigger while cephfs 
stays about the same size (overwrites on files certainly happen).
I suspect some snapshots mechanism. Any ideas how to debug this to stop it?

Maybe we should try to speed up the deep scrubbing somehow?

ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific 
(stable)

  cluster:
    id:     ece0290c-cd32-11ec-a0e2-005056a9dd02
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            3 nearfull osd(s)
            13 pool(s) nearfull

  services:
    mon: 3 daemons, quorum 
acdh-gluster-hdd3,acdh-gluster-hdd1,acdh-gluster-hdd2 (age 3d)
    mgr: acdh-gluster-hdd3.kzsplh(active, since 5d), standbys: 
acdh-gluster-hdd2.kiotbg, acdh-gluster-hdd1.ywgyfx
    mds: 1/1 daemons up, 2 standby
    osd: 3 osds: 3 up (since 4d), 3 in (since 7w)
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   13 pools, 292 pgs
    objects: 167.25M objects, 1.2 TiB
    usage:   3.3 TiB used, 1.2 TiB / 4.5 TiB avail
    pgs:     290 active+clean
             1   active+clean+scrubbing+deep
             1   active+clean+scrubbing

  io:
    client:   58 MiB/s rd, 3.6 MiB/s wr, 51 op/s rd, 148 op/s wr

rancher-ceph-fs - 227 clients
===============
RANK  STATE                  MDS                    ACTIVITY DNS    
INOS   DIRS   CAPS
 0    active  ceph-mds.acdh-gluster-hdd1.pqydya  Reqs:   68 /s 793k   
792k   102k   210k
          POOL              TYPE     USED  AVAIL
 rancherFsPoolMetadata    metadata   160G   329G
rancherFsPoolDefaultData    data    2268k   329G
 rancherFsPoolMainData      data    2584G   658G
           STANDBY MDS
ceph-mds.acdh-gluster-hdd2.zfleqe
ceph-mds.acdh-gluster-hdd3.etaobl
MDS version: ceph version 16.2.11 
(3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)

(rancherFsPoolMainData is a 2+1 erasure encoded pool)

--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    4.5 TiB  1.2 TiB  3.3 TiB   3.3 TiB      73.46
TOTAL  4.5 TiB  1.2 TiB  3.3 TiB   3.3 TiB      73.46

--- POOLS ---
POOL                        ID  PGS   STORED  OBJECTS     USED %USED  
MAX AVAIL
device_health_metrics        1    1      0 B        0      0 B      0    
331 GiB
rancher-rbd-erasure          2   32  8.4 GiB    2.16k   13 GiB 1.25    
661 GiB
rancher-rbd-meta             3   32     55 B       11   36 KiB      0    
331 GiB
rancherFsPoolMetadata        4   32   53 GiB    5.18M  160 GiB 13.88    
331 GiB
rancherFsPoolDefaultData     5    1   29 KiB   80.00M  2.2 MiB      0    
331 GiB
rancherFsPoolMainData        6    1  1.7 TiB   82.08M  2.5 TiB 72.23    
661 GiB
.rgw.root                    7   32  1.3 KiB        4   48 KiB      0    
331 GiB
default.rgw.log              8   32  3.6 KiB      209  408 KiB      0    
331 GiB
default.rgw.control          9   32      0 B        8      0 B      0    
331 GiB
default.rgw.meta            10   32  3.8 KiB       11  124 KiB      0    
331 GiB
default.rgw.buckets.index   11   32  2.4 MiB       33  7.2 MiB      0    
331 GiB
default.rgw.buckets.non-ec  12   32      0 B        0      0 B      0    
331 GiB
default.rgw.buckets.data    14    1   55 GiB   16.57k   83 GiB 7.70    
661 GiB

HEALTH_WARN 1 MDSs report slow metadata IOs; 3 nearfull osd(s); 13 
pool(s) nearfull
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.ceph-mds.acdh-gluster-hdd1.pqydya(mds.0): 100+ slow metadata 
IOs are blocked > 30 secs, oldest blocked for 306 secs
[WRN] OSD_NEARFULL: 3 nearfull osd(s)
    osd.0 is near full
    osd.2 is near full
    osd.3 is near full
[WRN] POOL_NEARFULL: 13 pool(s) nearfull
    pool 'device_health_metrics' is nearfull
    pool 'rancher-rbd-erasure' is nearfull
    pool 'rancher-rbd-meta' is nearfull
    pool 'rancherFsPoolMetadata' is nearfull
    pool 'rancherFsPoolDefaultData' is nearfull
    pool 'rancherFsPoolMainData' is nearfull
    pool '.rgw.root' is nearfull
    pool 'default.rgw.log' is nearfull
    pool 'default.rgw.control' is nearfull
    pool 'default.rgw.meta' is nearfull
    pool 'default.rgw.buckets.index' is nearfull
    pool 'default.rgw.buckets.non-ec' is nearfull
    pool 'default.rgw.buckets.data' is nearfull

(near full is set to 0.66)

--
Mag. Ing. Omar Siam
Austrian Center for Digital Humanities and Cultural Heritage
Österreichische Akademie der Wissenschaften | Austrian Academy of Sciences
Stellvertretende Behindertenvertrauensperson | Deputy representative for disabled persons
Bäckerstraße 13, 1010 Wien, Österreich | Vienna, Austria
T: +43 1 51581-7295
omar.siam@xxxxxxxxxx | www.oeaw.ac.at/acdh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx