Ceph cache pool full

Shawfeng Dong <shaw@xxxxxxxx> · Thu, 5 Oct 2017 16:45:41 -0700

Dear all,
We just set up a Ceph cluster, running the latest stable release Ceph v12.2.0 (Luminous):
# ceph --version
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)

The goal is to serve Ceph filesystem, for which we created 3 pools:
# ceph osd lspools
1 cephfs_data,2 cephfs_metadata,3 cephfs_cache,
where
* cephfs_data is the data pool (36 OSDs on HDDs), which is erased-coded;
* cephfs_metadata is the metadata pool
* cephfs_cache is the cache tier (3 OSDs on NVMes) for cephfs_data. The cache-mode is writeback.

Everything had worked fine, until today when we tried to copy a 1.3TB file to the CephFS.  We got the "No space left on device" error!

'ceph -s' says some OSDs are full:
# ceph -s
  cluster:
    id:     e18516bf-39cb-4670-9f13-88ccb7d19769
    health: HEALTH_ERR
            full flag(s) set
            1 full osd(s)
            1 pools have many more objects per pg than average

  services:
    mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
    mgr: pulpo-mds01(active), standbys: pulpo-admin, pulpo-mon01
    mds: pulpos-1/1/1 up  {0=pulpo-mds01=up:active}
    osd: 39 osds: 39 up, 39 in
         flags full

  data:
    pools:   3 pools, 2176 pgs
    objects: 347k objects, 1381 GB
    usage:   2847 GB used, 262 TB / 265 TB avail
    pgs:     2176 active+clean

  io:
    client:   19301 kB/s rd, 2935 op/s rd, 0 op/s wr

And indeed the cache pool is full:
# rados df
POOL_NAME       USED  OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS   RD
    WR_OPS  WR
cephfs_cache    1381G  355385      0 710770                  0       0        0 10004954 15
22G 1398063  1611G
cephfs_data         0       0      0      0                  0       0        0        0
  0       0      0
cephfs_metadata 8515k      24      0     72                  0       0        0        3  3
072    3953 10541k

total_objects    355409
total_used       2847G
total_avail      262T
total_space      265T

However, the data pool is completely empty! So it seems that data has only been written to the cache pool, but not written back to the data pool.

I am really at a loss whether this is due to a setup error on my part, or a Luminous bug. Could anyone shed some light on this? Please let me know if you need any further info.

Best,
Shaw
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com