Re: Troubleshooting an erasure coded pool with a cache tier

Christian Balzer <chibi@xxxxxxx> · Tue, 18 Nov 2014 10:11:27 +0900

Hello,

On Mon, 17 Nov 2014 17:45:54 +0100 Laurent GUERBY wrote:

> Hi,
> 
> Just a follow-up on this issue, we're probably hitting:
> 
> http://tracker.ceph.com/issues/9285
> 
> We had the issue a few weeks ago with replicated SSD pool in front of
> rotational pool and turned off cache tiering. 
> 
> Yesterday we made a new test and activating cache tiering on a single
> erasure pool threw the whole ceph cluster performance to the floor
> (including non cached non erasure coded pools) with frequent "slow
> write" in the logs. Removing cache tiering was enough to go back to
> normal performance.
>
Ouch!

> I assume no one use cache tiering on 0.80.7 in production clusters?
>
Not me and now I'm even less inclined to do so. 
Since this particular item is not the first one that puts cache tiers in
doubt, but certainly the most compelling one.

I wonder how much pressure was on that cache tier, though. 
If I understand the bug report correctly, this should only happen if
some object gets evicted before it was fully replicated.
So I suppose if the cache pool is sized "correctly" for the working set in
question (which of course is a bugger given a 4MB granularity), things
should work. Until you hit the threshold and they don't anymore...

Given that this isn't fixed in Giant either, there goes my plan to speed
up a cluster with ample space but insufficient IOPS with cache tiering.

Christian

> Sincerely,
> 
> Laurent
> 
> Le Sunday 09 November 2014 à 00:24 +0100, Loic Dachary a écrit :
> > 
> > On 09/11/2014 00:03, Gregory Farnum wrote:
> > > It's all about the disk accesses. What's the slow part when you dump
> > > historic and in-progress ops?
> > 
> > This is what I see on g1 (6% iowait)
> > 
> > root@g1:~# ceph daemon osd.0 dump_ops_in_flight
> > { "num_ops": 0,
> >   "ops": []}
> > 
> > root@g1:~# ceph daemon osd.0 dump_ops_in_flight
> > { "num_ops": 1,
> >   "ops": [
> >         { "description": "osd_op(client.4407100.0:11030174
> > rb.0.410809.238e1f29.000000001038 [set-alloc-hint object_size 4194304
> > write_size 4194304,write 4095488~4096] 58.3aabb66d ack+ondisk+write
> > e15613)", "received_at": "2014-11-09 00:14:17.385256", "age":
> > "0.538802", "duration": "0.011955", "type_data": [
> >                 "waiting for sub ops",
> >                 { "client": "client.4407100",
> >                   "tid": 11030174},
> >                 [
> >                     { "time": "2014-11-09 00:14:17.385393",
> >                       "event": "waiting_for_osdmap"},
> >                     { "time": "2014-11-09 00:14:17.385563",
> >                       "event": "reached_pg"},
> >                     { "time": "2014-11-09 00:14:17.385793",
> >                       "event": "started"},
> >                     { "time": "2014-11-09 00:14:17.385807",
> >                       "event": "started"},
> >                     { "time": "2014-11-09 00:14:17.385875",
> >                       "event": "waiting for subops from 1,10"},
> >                     { "time": "2014-11-09 00:14:17.386201",
> >                       "event": "commit_queued_for_journal_write"},
> >                     { "time": "2014-11-09 00:14:17.386336",
> >                       "event": "write_thread_in_journal_buffer"},
> >                     { "time": "2014-11-09 00:14:17.396293",
> >                       "event": "journaled_completion_queued"},
> >                     { "time": "2014-11-09 00:14:17.396332",
> >                       "event": "op_commit"},
> >                     { "time": "2014-11-09 00:14:17.396678",
> >                       "event": "op_applied"},
> >                     { "time": "2014-11-09 00:14:17.397211",
> >                       "event": "sub_op_commit_rec"}]]}]}
> > 
> > and it looks ok. When I go to n7 which has 20% iowait, I see a much
> > larger output http://pastebin.com/DPxsaf6z which includes a number of
> > "event": "waiting_for_osdmap".
> > 
> > I'm not sure what to make of this and it would certainly be better if
> > n7 had a lower iowait. Also when I ceph -w I see a new pgmap is
> > created every second which is also not a good sign.
> > 
> > 2014-11-09 00:22:47.090795 mon.0 [INF] pgmap v4389613: 460 pgs: 460
> > active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail;
> > 3889 B/s rd, 2125 kB/s wr, 237 op/s 2014-11-09 00:22:48.143412 mon.0
> > [INF] pgmap v4389614: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB
> > used, 18850 GB / 26955 GB avail; 1586 kB/s wr, 204 op/s 2014-11-09
> > 00:22:49.172794 mon.0 [INF] pgmap v4389615: 460 pgs: 460 active+clean;
> > 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 343 kB/s wr, 88
> > op/s 2014-11-09 00:22:50.222958 mon.0 [INF] pgmap v4389616: 460 pgs:
> > 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB
> > avail; 412 kB/s wr, 130 op/s 2014-11-09 00:22:51.281294 mon.0 [INF]
> > pgmap v4389617: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used,
> > 18850 GB / 26955 GB avail; 1195 kB/s wr, 167 op/s 2014-11-09
> > 00:22:52.318895 mon.0 [INF] pgmap v4389618: 460 pgs: 460 active+clean;
> > 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 5864 B/s rd,
> > 2762 kB/s wr, 206 op/s
> > 
> > Cheers
> > 
> > 
> > > On Sat, Nov 8, 2014 at 2:30 PM Loic Dachary <loic@xxxxxxxxxxx
> > > <mailto:loic@xxxxxxxxxxx>> wrote:
> > > 
> > >     Hi Greg,
> > > 
> > >     On 08/11/2014 20:19, Gregory Farnum wrote:> When acting as a
> > > cache pool it needs to go do a lookup on the base pool for every
> > > object it hasn't encountered before. I assume that's why it's slower.
> > >     > (The penalty should not be nearly as high as you're seeing
> > >     > here, but based on the low numbers I imagine you're running
> > >     > everything on an overloaded laptop or something.)
> > > 
> > >     It's running on a small cluster that is busy but not to a point
> > > that I expect such a difference:
> > > 
> > >     # dsh --concurrent-shell --show-machine-names
> > > --remoteshellopt=-p2222 -m g1 -m g2 -m g3 -m n7 -m stri dstat -c 10
> > > 3 g1: ----total-cpu-usage---- g1: usr sys idl wai hiq siq
> > >     g1:   6   1  88   6   0   0
> > >     g2: ----total-cpu-usage----
> > >     g2: usr sys idl wai hiq siq
> > >     g2:   4   1  88   7   0   0
> > >     n7: ----total-cpu-usage----
> > >     n7: usr sys idl wai hiq siq
> > >     n7:  18   3  58  20   0   1
> > >     stri: ----total-cpu-usage----
> > >     stri: usr sys idl wai hiq siq
> > >     stri:   6   1  86   6   0   0
> > >     g3: ----total-cpu-usage----
> > >     g3: usr sys idl wai hiq siq
> > >     g3:  37   2  55   5   0   1
> > >     g1:   2   0  93   4   0   0
> > >     g2:   2   0  92   6   0   0
> > >     n7:  13   2  65  20   0   1
> > >     stri:   4   1  92   3   0   0
> > >     g3:  32   2  62   4   0   1
> > >     g1:   3   0  94   3   0   0
> > >     g2:   3   1  94   3   0   0
> > >     n7:  13   3  61  22   0   1
> > >     stri:   4   1  90   5   0   0
> > >     g3:  31   2  61   4   0   1
> > >     g1:   3   0  89   7   0   0
> > >     g2:   2   1  89   8   0   0
> > >     n7:  20   3  50  25   0   1
> > >     stri:   6   1  87   5   0   0
> > >     g3:  57   2  36   3   0   1
> > > 
> > >     # ceph tell osd.\* version
> > >     osd.0: { "version": "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.1: { "version":
> > > "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.2: { "version":
> > > "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.3: { "version":
> > > "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.4: { "version":
> > > "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.5: { "version":
> > > "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.6: { "version":
> > > "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.7: { "version":
> > > "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.8: { "version":
> > > "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.9: { "version":
> > > "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.10:
> > > { "version": "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.11:
> > > { "version": "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.12:
> > > { "version": "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.13:
> > > { "version": "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.14:
> > > { "version": "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} osd.15:
> > > { "version": "ceph version 0.80.6
> > > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
> > > 
> > >     Cheers
> > > 
> > >     > -Greg
> > >     > On Sat, Nov 8, 2014 at 11:14 AM Loic Dachary <loic@xxxxxxxxxxx
> > >     > <mailto:loic@xxxxxxxxxxx> <mailto:loic@xxxxxxxxxxx
> > >     > <mailto:loic@xxxxxxxxxxx>>> wrote:
> > >     >
> > >     >     Hi,
> > >     >
> > >     >     This is a first attempt, it is entirely possible that the
> > >     > solution is simple or RTFM ;-)
> > >     >
> > >     >     Here is the problem observed:
> > >     >
> > >     >     rados --pool ec4p1 bench 120 write # the erasure coded pool
> > >     >     Total time run:         147.207804
> > >     >     Total writes made:      458
> > >     >     Write size:             4194304
> > >     >     Bandwidth (MB/sec):     12.445
> > >     >
> > >     >     rados --pool disks bench 120 write # same crush ruleset at
> > >     > the cache tier Total time run:         126.312601
> > >     >     Total writes made:      1092
> > >     >     Write size:             4194304
> > >     >     Bandwidth (MB/sec):     34.581
> > >     >
> > >     >     There must be something wrong in how the cache tier is
> > >     > setup: one would expect the same write speed since the total
> > >     > size written (a few GB) is lower than the size of the cache
> > >     > pool. Instead the write speed is consistently at least twice
> > >     > slower (12.445 * 2 < 34.581).
> > >     >
> > >     >     root@g1:~# ceph osd dump | grep disks
> > >     >     pool 58 'disks' replicated size 3 min_size 2 crush_ruleset
> > >     > 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change
> > >     > 15110 lfor 12228 flags hashpspool stripe_width 0 root@g1:~#
> > >     > ceph osd dump | grep ec4 pool 74 'ec4p1' erasure size 5
> > >     > min_size 4 crush_ruleset 2 object_hash rjenkins pg_num 32
> > >     > pgp_num 32 last_change 15604 lfor 15604 flags hashpspool tiers
> > >     > 75 read_tier 75 write_tier 75 stripe_width 4096 pool 75
> > >     > 'ec4p1c' replicated size 3 min_size 2 crush_ruleset 0
> > >     > object_hash rjenkins pg_num 12 pgp_num 12 last_change 15613
> > >     > flags hashpspool,incomplete_clones tier_of 74 cache_mode
> > >     > writeback target_bytes 1000000000 target_objects 1000000000
> > >     > hit_set bloom{false_positive_____probability: 0.05,
> > >     > target_size: 0, seed: 0} 3600s x1 stripe_width 0
> > >     >
> > >     >     root@g1:~# ceph df
> > >     >     GLOBAL:
> > >     >         SIZE       AVAIL      RAW USED     %RAW USED
> > >     >         26955G     18850G        6735G         24.99
> > >     >     POOLS:
> > >     >         NAME            ID     USED       %USED     MAX
> > >     > AVAIL     OBJECTS ..
> > >     >         disks           58      1823G      6.76
> > >     > 5305G      471080 ..
> > >     >         ec4p1           74       589G      2.19
> > >     > 12732G      153623 ec4p1c          75     57501k
> > >     > 0         5305G         491
> > >     >
> > >     >
> > >     >     Cheers
> > >     >     --
> > >     >     Loïc Dachary, Artisan Logiciel Libre
> > >     >
> > >     >     ___________________________________________________
> > >     >     ceph-users mailing list
> > >     >     ceph-users@xxxxxxxxxxxxxx
> > >     > <mailto:ceph-users@xxxxxxxxxxxxxx>
> > >     > <mailto:ceph-users@xxxxxxxxxx.__com
> > >     > <mailto:ceph-users@xxxxxxxxxxxxxx>>
> > >     > http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com
> > >     > <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com>
> > >     > <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
> > >     > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>
> > >     >
> > > 
> > >     --
> > >     Loïc Dachary, Artisan Logiciel Libre
> > > 
> > 
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com