Re: RBD over cache tier over EC pool: rbd rm doesn't remove objects

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 28 Jan 2015 06:31:52 -0800 (PST)

On Wed, 28 Jan 2015, Irek Fasikhov wrote:
> Sage. 
> Is a sentence when deleting objects bypass the cache tier pool.

There's currently no knob or hint to do that.  It would be pretty simple 
to add, but it's a heuristic that only works for certain workloads..

sage

> Thank
> 
> Wed Jan 28 2015 at 5:13:36 PM, Irek Fasikhov <malmyzh@xxxxxxxxx>:
>       Hi,Sage.
> 
>       Yes, Firefly.
>       [root@ceph05 ~]# ceph --version
> ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
> 
> Yes, I have seen this behavior.
> 
> [root@ceph08 ceph]# rbd info vm-160-disk-1
> rbd image 'vm-160-disk-1':
>         size 32768 MB in 8192 objects
>         order 22 (4096 kB objects)
>         block_name_prefix: rbd_data.179faf52eb141f2
>         format: 2
>         features: layering
>         parent: rbd/base-145-disk-1@__base__
>         overlap: 32768 MB
> [root@ceph08 ceph]# rbd rm vm-160-disk-1
> Removing image: 100% complete...done.
> [root@ceph08 ceph]# rbd info vm-160-disk-1
> 2015-01-28 10:39:01.595785 7f1fbea9e760 -1 librbd::ImageCtx: error
> finding header: (2) No such file or directoryrbd: error opening image
> vm-160-disk-1: (2) No such file or directory
> 
> [root@ceph08 ceph]# rados -p rbdcache ls | grep 179faf52eb141f2 | wc
>    5944    5944  249633
> [root@ceph08 ceph]# rados -p rbdcache ls | grep 179faf52eb141f2 | wc
>    5857    5857  245979
> [root@ceph08 ceph]# rados -p rbd ls | grep 179faf52eb141f2 | wc
>    4377    4377  183819
> [root@ceph08 ceph]# rados -p rbdcache ls | grep 179faf52eb141f2 | wc
>    5017    5017  210699
> [root@ceph08 ceph]# rados -p rbdcache ls | grep 179faf52eb141f2 | wc
>    5015    5015  210615
> [root@ceph08 ceph]# rados -p rbd ls | grep 179faf52eb141f2 | wc
> [root@ceph08 ceph]# rados -p rcachehe ls | grep 179faf52eb141f2 | wc
>    1986    1986   83412
> [root@ceph08 ceph]# rados -p rbd ls | grep 179faf52eb141f2 | wc
>     981     981   41202
> [root@ceph08 ceph]# rados -p rbd ls | grep 179faf52eb141f2 | wc
>     802     802   33684
> [root@ceph08 ceph]# rados -p rbdcache ls | grep 179faf52eb141f2 | wc
>    1611    1611   67662
> 
> Thank, Sage!
> 
> 
> Tue Jan 27 2015 at 7:01:43 PM, Sage Weil <sage@xxxxxxxxxxxx>:
> 
>       On Tue, 27 Jan 2015, Irek Fasikhov wrote:
>       > Hi,All.
>       > Indeed, there is a problem. Removed 1 TB of data space
>       on a cluster is not
>       > cleared. This feature of the behavior or a bug? And how
>       long will it be
>       > cleaned?
> 
>       Your subject says cache tier but I don't see it in the
>       'ceph df' output
>       below.  The cache tiers will store 'whiteout' objects that
>       cache object
>       non-existence that could be delaying some deletion.  You
>       can wrangle the
>       cluster into flushing those with
> 
>        ceph osd pool set <cachepool> cache_target_dirty_ratio
>       .05
> 
>       (though you'll probably want to change it back to the
>       default .4 later).
> 
>       If there's no cache tier involved, there may be another
>       problem.  What
>       version is this?  Firefly?
> 
>       sage
> 
>       >
>       > Sat Sep 20 2014 at 8:19:24 AM, Mika?l Cluseau
>       <mcluseau@xxxxxx>:
>       >       Hi all,
>       >
>       >       I have weird behaviour on my firefly "test +
>       convenience
>       >       storage" cluster. It consists of 2 nodes with a
>       light imbalance
>       >       in available space:
>       >
>       >       # id    weight    type name    up/down    reweight
>       >       -1    14.58    root default
>       >       -2    8.19        host store-1
>       >       1    2.73            osd.1    up    1   
>       >       0    2.73            osd.0    up    1   
>       >       5    2.73            osd.5    up    1   
>       >       -3    6.39        host store-2
>       >       2    2.73            osd.2    up    1   
>       >       3    2.73            osd.3    up    1   
>       >       4    0.93            osd.4    up    1   
>       >
>       >       I used to store ~8TB of rbd volumes, coming to a
>       near-full
>       >       state. There was some annoying "stuck misplaced"
>       PGs so I began
>       >       to remove 4.5TB of data; the weird thing is: the
>       space hasn't
>       >       been reclaimed on the OSDs, they keeped stuck
>       around 84% usage.
>       >       I tried to move PGs around and it happens that the
>       space is
>       >       correctly "reclaimed" if I take an OSD out, let
>       him empty it XFS
>       >       volume and then take it in again.
>       >
>       >       I'm currently applying this to and OSD in turn,
>       but I though it
>       >       could be worth telling about this. The current
>       ceph df output
>       >       is:
>       >
>       >       GLOBAL:
>       >           SIZE       AVAIL     RAW USED     %RAW USED
>       >           12103G     5311G     6792G        56.12    
>       >       POOLS:
>       >           NAME                 ID     USED      
>       %USED     OBJECTS
>       >           data                 0      0         
>       0         0      
>       >           metadata             1      0         
>       0         0      
>       >           rbd                  2      444G      
>       3.67      117333 
>       >       [...]
>       >           archives-ec          14     3628G     
>       29.98     928902 
>       >           archives             15     37518M    
>       0.30      273167
>       >
>       >       Before "just moving data", AVAIL was around 3TB.
>       >
>       >       I finished the process with the OSDs on store-1,
>       who show the
>       >       following space usage now:
>       >
>       >       /dev/sdb1             2.8T  1.4T  1.4T  50%
>       >       /var/lib/ceph/osd/ceph-0
>       >       /dev/sdc1             2.8T  1.3T  1.5T  46%
>       >       /var/lib/ceph/osd/ceph-1
>       >       /dev/sdd1             2.8T  1.3T  1.5T  48%
>       >       /var/lib/ceph/osd/ceph-5
>       >
>       >       I'm currently fixing OSD 2, 3 will be the last one
>       to be fixed.
>       >       The df on store-2 shows the following:
>       >
>       >       /dev/sdb1               2.8T  1.9T  855G  70%
>       >       /var/lib/ceph/osd/ceph-2
>       >       /dev/sdc1               2.8T  2.4T  417G  86%
>       >       /var/lib/ceph/osd/ceph-3
>       >       /dev/sdd1               932G  481G  451G  52%
>       >       /var/lib/ceph/osd/ceph-4
>       >
>       >       OSD 2 was at 84% 3h ago, and OSD 3 was ~75%.
>       >
>       >       During rbd rm (that took a bit more that 3 days),
>       ceph log was
>       >       showing things like that:
>       >
>       >       2014-09-03 16:17:38.831640 mon.0
>       192.168.1.71:6789/0 417194 :
>       >       [INF] pgmap v14953987: 3196 pgs: 2882
>       active+clean, 314
>       >       active+remapped; 7647 GB data, 11067 GB used, 3828
>       GB / 14896 GB
>       >       avail; 0 B/s rd, 6778 kB/s wr, 18 op/s; -5/5757286
>       objects
>       >       degraded (-0.000%)
>       >       [...]
>       >       2014-09-05 03:09:59.895507 mon.0
>       192.168.1.71:6789/0 513976 :
>       >       [INF] pgmap v15050766: 3196 pgs: 2882
>       active+clean, 314
>       >       active+remapped; 6010 GB data, 11156 GB used, 3740
>       GB / 14896 GB
>       >       avail; 0 B/s rd, 0 B/s wr, 8 op/s; -388631/5247320
>       objects
>       >       degraded (-7.406%)
>       >       [...]
>       >       2014-09-06 03:56:50.008109 mon.0
>       192.168.1.71:6789/0 580816 :
>       >       [INF] pgmap v15117604: 3196 pgs: 2882
>       active+clean, 314
>       >       active+remapped; 4865 GB data, 11207 GB used, 3689
>       GB / 14896 GB
>       >       avail; 0 B/s rd, 6117 kB/s wr, 22 op/s;
>       -706519/3699415 objects
>       >       degraded (-19.098%)
>       >       2014-09-06 03:56:44.476903 osd.0
>       192.168.1.71:6805/11793 729 :
>       >       [WRN] 1 slow requests, 1 included below; oldest
>       blocked for >
>       >       30.058434 secs
>       >       2014-09-06 03:56:44.476909 osd.0
>       192.168.1.71:6805/11793 730 :
>       >       [WRN] slow request 30.058434 seconds old, received
>       at 2014-09-06
>       >       03:56:14.418429: osd_op(client.19843278.0:46081
>       >       rb.0.c7fd7f.238e1f29.00000000b3fa [delete]
>       15.b8fb7551
>       >       ack+ondisk+write e38950) v4 currently waiting for
>       blocked object
>       >       2014-09-06 03:56:49.477785 osd.0
>       192.168.1.71:6805/11793 731 :
>       >       [WRN] 2 slow requests, 1 included below; oldest
>       blocked for >
>       >       35.059315 secs
>       >       [... stabilizes here:]
>       >       2014-09-06 22:13:48.771531 mon.0
>       192.168.1.71:6789/0 632527 :
>       >       [INF] pgmap v15169313: 3196 pgs: 2882
>       active+clean, 314
>       >       active+remapped; 4139 GB data, 11215 GB used, 3681
>       GB / 14896 GB
>       >       avail; 64 B/s rd, 64 B/s wr, 0 op/s;
>       -883219/3420796 objects
>       >       degraded (-25.819%)
>       >       [...]
>       >       2014-09-07 03:09:48.491325 mon.0
>       192.168.1.71:6789/0 633880 :
>       >       [INF] pgmap v15170666: 3196 pgs: 2882
>       active+clean, 314
>       >       active+remapped; 4139 GB data, 11215 GB used, 3681
>       GB / 14896 GB
>       >       avail; 18727 B/s wr, 2 op/s; -883219/3420796
>       objects degraded
>       >       (-25.819%)
>       >
>       >       And now, during data movement I described before:
>       >
>       >       2014-09-20 15:16:13.394694 mon.0 [INF] pgmap
>       v15344707: 3196
>       >       pgs: 2132 active+clean, 432
>       active+remapped+wait_backfill, 621
>       >       active+remapped, 11 active+remapped+backfilling;
>       4139 GB data,
>       >       6831 GB used, 5271 GB / 12103 GB avail;
>       379097/3792969 objects
>       >       degraded (9.995%)
>       >
>       >       If some ceph developer wants me to do something or
>       to provide
>       >       some data, please say so quickly, I will probably
>       process OSD 3
>       >       in ~16-20h.
>       >       (of course, I'd prefer not loose the data btw :-))
>       > _______________________________________________
>       > ceph-users mailing list
>       > ceph-users@xxxxxxxxxxxxxx
>       > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>       >
>       >
>       >
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com