Re: Troubleshooting an erasure coded pool with a cache tier

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 09/11/2014 00:03, Gregory Farnum wrote:
> It's all about the disk accesses. What's the slow part when you dump historic and in-progress ops?

This is what I see on g1 (6% iowait)

root@g1:~# ceph daemon osd.0 dump_ops_in_flight
{ "num_ops": 0,
  "ops": []}

root@g1:~# ceph daemon osd.0 dump_ops_in_flight
{ "num_ops": 1,
  "ops": [
        { "description": "osd_op(client.4407100.0:11030174 rb.0.410809.238e1f29.000000001038 [set-alloc-hint object_size 4194304 write_size 4194304,write 4095488~4096] 58.3aabb66d ack+ondisk+write e15613)",
          "received_at": "2014-11-09 00:14:17.385256",
          "age": "0.538802",
          "duration": "0.011955",
          "type_data": [
                "waiting for sub ops",
                { "client": "client.4407100",
                  "tid": 11030174},
                [
                    { "time": "2014-11-09 00:14:17.385393",
                      "event": "waiting_for_osdmap"},
                    { "time": "2014-11-09 00:14:17.385563",
                      "event": "reached_pg"},
                    { "time": "2014-11-09 00:14:17.385793",
                      "event": "started"},
                    { "time": "2014-11-09 00:14:17.385807",
                      "event": "started"},
                    { "time": "2014-11-09 00:14:17.385875",
                      "event": "waiting for subops from 1,10"},
                    { "time": "2014-11-09 00:14:17.386201",
                      "event": "commit_queued_for_journal_write"},
                    { "time": "2014-11-09 00:14:17.386336",
                      "event": "write_thread_in_journal_buffer"},
                    { "time": "2014-11-09 00:14:17.396293",
                      "event": "journaled_completion_queued"},
                    { "time": "2014-11-09 00:14:17.396332",
                      "event": "op_commit"},
                    { "time": "2014-11-09 00:14:17.396678",
                      "event": "op_applied"},
                    { "time": "2014-11-09 00:14:17.397211",
                      "event": "sub_op_commit_rec"}]]}]}

and it looks ok. When I go to n7 which has 20% iowait, I see a much larger output http://pastebin.com/DPxsaf6z which includes a number of "event": "waiting_for_osdmap".

I'm not sure what to make of this and it would certainly be better if n7 had a lower iowait. Also when I ceph -w I see a new pgmap is created every second which is also not a good sign.

2014-11-09 00:22:47.090795 mon.0 [INF] pgmap v4389613: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 3889 B/s rd, 2125 kB/s wr, 237 op/s
2014-11-09 00:22:48.143412 mon.0 [INF] pgmap v4389614: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1586 kB/s wr, 204 op/s
2014-11-09 00:22:49.172794 mon.0 [INF] pgmap v4389615: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 343 kB/s wr, 88 op/s
2014-11-09 00:22:50.222958 mon.0 [INF] pgmap v4389616: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 412 kB/s wr, 130 op/s
2014-11-09 00:22:51.281294 mon.0 [INF] pgmap v4389617: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1195 kB/s wr, 167 op/s
2014-11-09 00:22:52.318895 mon.0 [INF] pgmap v4389618: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 5864 B/s rd, 2762 kB/s wr, 206 op/s

Cheers


> On Sat, Nov 8, 2014 at 2:30 PM Loic Dachary <loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx>> wrote:
> 
>     Hi Greg,
> 
>     On 08/11/2014 20:19, Gregory Farnum wrote:> When acting as a cache pool it needs to go do a lookup on the base pool for every object it hasn't encountered before. I assume that's why it's slower.
>     > (The penalty should not be nearly as high as you're seeing here, but based on the low numbers I imagine you're running everything on an overloaded laptop or something.)
> 
>     It's running on a small cluster that is busy but not to a point that I expect such a difference:
> 
>     # dsh --concurrent-shell --show-machine-names --remoteshellopt=-p2222 -m g1 -m g2 -m g3 -m n7 -m stri dstat -c 10 3
>     g1: ----total-cpu-usage----
>     g1: usr sys idl wai hiq siq
>     g1:   6   1  88   6   0   0
>     g2: ----total-cpu-usage----
>     g2: usr sys idl wai hiq siq
>     g2:   4   1  88   7   0   0
>     n7: ----total-cpu-usage----
>     n7: usr sys idl wai hiq siq
>     n7:  18   3  58  20   0   1
>     stri: ----total-cpu-usage----
>     stri: usr sys idl wai hiq siq
>     stri:   6   1  86   6   0   0
>     g3: ----total-cpu-usage----
>     g3: usr sys idl wai hiq siq
>     g3:  37   2  55   5   0   1
>     g1:   2   0  93   4   0   0
>     g2:   2   0  92   6   0   0
>     n7:  13   2  65  20   0   1
>     stri:   4   1  92   3   0   0
>     g3:  32   2  62   4   0   1
>     g1:   3   0  94   3   0   0
>     g2:   3   1  94   3   0   0
>     n7:  13   3  61  22   0   1
>     stri:   4   1  90   5   0   0
>     g3:  31   2  61   4   0   1
>     g1:   3   0  89   7   0   0
>     g2:   2   1  89   8   0   0
>     n7:  20   3  50  25   0   1
>     stri:   6   1  87   5   0   0
>     g3:  57   2  36   3   0   1
> 
>     # ceph tell osd.\* version
>     osd.0: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.1: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.2: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.3: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.4: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.5: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.6: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.7: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.8: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.9: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.10: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.11: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.12: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.13: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.14: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.15: { "version": "ceph version 0.80.6 (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
> 
>     Cheers
> 
>     > -Greg
>     > On Sat, Nov 8, 2014 at 11:14 AM Loic Dachary <loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx> <mailto:loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx>>> wrote:
>     >
>     >     Hi,
>     >
>     >     This is a first attempt, it is entirely possible that the solution is simple or RTFM ;-)
>     >
>     >     Here is the problem observed:
>     >
>     >     rados --pool ec4p1 bench 120 write # the erasure coded pool
>     >     Total time run:         147.207804
>     >     Total writes made:      458
>     >     Write size:             4194304
>     >     Bandwidth (MB/sec):     12.445
>     >
>     >     rados --pool disks bench 120 write # same crush ruleset at the cache tier
>     >     Total time run:         126.312601
>     >     Total writes made:      1092
>     >     Write size:             4194304
>     >     Bandwidth (MB/sec):     34.581
>     >
>     >     There must be something wrong in how the cache tier is setup: one would expect the same write speed since the total size written (a few GB) is lower than the size of the cache pool. Instead the write speed is consistently at least twice slower (12.445 * 2 < 34.581).
>     >
>     >     root@g1:~# ceph osd dump | grep disks
>     >     pool 58 'disks' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 15110 lfor 12228 flags hashpspool stripe_width 0
>     >     root@g1:~# ceph osd dump | grep ec4
>     >     pool 74 'ec4p1' erasure size 5 min_size 4 crush_ruleset 2 object_hash rjenkins pg_num 32 pgp_num 32 last_change 15604 lfor 15604 flags hashpspool tiers 75 read_tier 75 write_tier 75 stripe_width 4096
>     >     pool 75 'ec4p1c' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 12 pgp_num 12 last_change 15613 flags hashpspool,incomplete_clones tier_of 74 cache_mode writeback target_bytes 1000000000 target_objects 1000000000 hit_set bloom{false_positive_____probability: 0.05, target_size: 0, seed: 0} 3600s x1 stripe_width 0
>     >
>     >     root@g1:~# ceph df
>     >     GLOBAL:
>     >         SIZE       AVAIL      RAW USED     %RAW USED
>     >         26955G     18850G        6735G         24.99
>     >     POOLS:
>     >         NAME            ID     USED       %USED     MAX AVAIL     OBJECTS
>     >     ..
>     >         disks           58      1823G      6.76         5305G      471080
>     >     ..
>     >         ec4p1           74       589G      2.19        12732G      153623
>     >         ec4p1c          75     57501k         0         5305G         491
>     >
>     >
>     >     Cheers
>     >     --
>     >     Loïc Dachary, Artisan Logiciel Libre
>     >
>     >     ___________________________________________________
>     >     ceph-users mailing list
>     >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxx.__com <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com> <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>
>     >
> 
>     --
>     Loïc Dachary, Artisan Logiciel Libre
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux