Re: RBD stale on VM, and RBD cache enable problem

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 3 Jul 2012 10:39:56 -0700

On Mon, Jun 11, 2012 at 12:53 PM, Sławomir Skowron <szibis@xxxxxxxxx> wrote:
> I have two questions. My newly created cluster with xfs on all osd,
> ubuntu precise, kernel 3.2.0-23-generic. Ceph 0.47.2-1precise
>
> pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num
> 64 pgp_num 64 last_change 1228 owner 0 crash_replay_interval 45
> pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins
> pg_num 64 pgp_num 64 last_change 1226 owner 0
> pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 64
> pgp_num 64 last_change 1232 owner 0
> pool 3 '.rgw' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 8
> pgp_num 8 last_change 3878 owner 18446744073709551615
>
> 1. After i stop all daemons on 1 machine in my 3 node cluster with 3
> replicas, rbd image operations on vm, staling. DD on this device in VM
> freezing, and after ceph start on this machine everything goes online.
> Is there any problem with my config ?? in this situation ceph should
> go from another copies with reads, and writes into another osd in
> replica chain, yes ??

It should switch to a new "primary" OSD as soon as they detect that
one machine is missing, which by default will be ~25 seconds. How long
did you wait to see if it would continue?
If you'd like to reduce this time, you can turn down some combination of
osd_heartbeat_grace — default 20 seconds, and controls how long an OSD
will wait before it decides a peer is down.
osd_min_down_reporters — default 1, controls how many OSDs need to
report an OSD as down before accepting it. This is already as low as
it should go
osd_min_down_reports — default 3, controls how many failure reports
the monitor needs to receive before accepting an OSD as down. Since
you only have 3 OSDs, and one is down, leaving this at 3 means you're
going to wait for osd_heartbeat_grace plus osd_mon_report_interval_min
(default 5; don't change this) before an OSD is marked down.

Given the logging you include I'm a little concerned that you have 1
PG "stale", indicating that the monitor hasn't gotten a report on that
PG in a very long time. That means either that one PG is somehow
broken, or else that the OSD you turned off isn't getting marked down
and that PG is the only one noticing it.
Could you re-run this test with monitor debugging turned up, see how
long it takes for the OSD to get marked down (using "ceph -w"), and
report back?
-Greg

> Another test iozone on device, and it's stop after daemons stop on 1
> machine, and after osd up, iozone go forward, how can i tune this to
> work without freeze ??
>
> 2012-06-11 21:38:49.583133    pg v88173: 200 pgs: 60 active+clean, 1
> stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
> 18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)
> 2012-06-11 21:38:50.582257    pg v88174: 200 pgs: 60 active+clean, 1
> stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
> 18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)
> .....
> 2012-06-11 21:39:49.991893    pg v88197: 200 pgs: 60 active+clean, 1
> stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
> 18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)
> 2012-06-11 21:39:50.992755    pg v88198: 200 pgs: 60 active+clean, 1
> stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
> 18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)
> 2012-06-11 21:39:51.993533    pg v88199: 200 pgs: 60 active+clean, 1
> stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
> 18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)
> 2012-06-11 21:39:52.994397    pg v88200: 200 pgs: 60 active+clean, 1
> stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
> 18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)
>
> After boot all osd on stoped machine:
>
> 2012-06-11 21:40:37.826619   osd e4162: 72 osds: 53 up, 72 in
> 2012-06-11 21:40:37.825706 mon.0 10.177.66.4:6790/0 348 : [INF] osd.24
> 10.177.66.6:6800/21597 boot
> 2012-06-11 21:40:38.825297    pg v88202: 200 pgs: 54 active+clean, 7
> stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
> 18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)
> 2012-06-11 21:40:38.826517   osd e4163: 72 osds: 54 up, 72 in
> 2012-06-11 21:40:38.825250 mon.0 10.177.66.4:6790/0 349 : [INF] osd.25
> 10.177.66.6:6803/21712 boot
> 2012-06-11 21:40:38.825655 mon.0 10.177.66.4:6790/0 350 : [INF] osd.28
> 10.177.66.6:6812/26210 boot
> 2012-06-11 21:40:38.825907 mon.0 10.177.66.4:6790/0 351 : [INF] osd.29
> 10.177.66.6:6815/26327 boot
> 2012-06-11 21:40:39.826738    pg v88203: 200 pgs: 56 active+clean, 4
> stale+active+clean, 3 peering, 137 active+degraded; 783 GB data, 1928
> GB used, 18111 GB / 20040 GB avail; 76921/254952 degraded (30.171%)
> 2012-06-11 21:40:39.830098   osd e4164: 72 osds: 59 up, 72 in
> 2012-06-11 21:40:39.826570 mon.0 10.177.66.4:6790/0 352 : [INF] osd.26
> 10.177.66.6:6806/21835 boot
> 2012-06-11 21:40:39.826961 mon.0 10.177.66.4:6790/0 353 : [INF] osd.27
> 10.177.66.6:6809/21953 boot
> 2012-06-11 21:40:39.828147 mon.0 10.177.66.4:6790/0 354 : [INF] osd.30
> 10.177.66.6:6818/26511 boot
> 2012-06-11 21:40:39.828418 mon.0 10.177.66.4:6790/0 355 : [INF] osd.31
> 10.177.66.6:6821/26583 boot
> 2012-06-11 21:40:39.828935 mon.0 10.177.66.4:6790/0 356 : [INF] osd.33
> 10.177.66.6:6827/26859 boot
> 2012-06-11 21:40:39.829274 mon.0 10.177.66.4:6790/0 357 : [INF] osd.34
> 10.177.66.6:6830/26979 boot
> 2012-06-11 21:40:40.827935    pg v88204: 200 pgs: 56 active+clean, 4
> stale+active+clean, 3 peering, 137 active+degraded; 783 GB data, 1928
> GB used, 18111 GB / 20040 GB avail; 76921/254952 degraded (30.171%)
> 2012-06-11 21:40:40.830059   osd e4165: 72 osds: 62 up, 72 in
> 2012-06-11 21:40:40.827798 mon.0 10.177.66.4:6790/0 358 : [INF] osd.32
> 10.177.66.6:6824/26701 boot
> 2012-06-11 21:40:40.829043 mon.0 10.177.66.4:6790/0 359 : [INF] osd.35
> 10.177.66.6:6833/27165 boot
> 2012-06-11 21:40:40.829316 mon.0 10.177.66.4:6790/0 360 : [INF] osd.36
> 10.177.66.6:6836/27280 boot
> 2012-06-11 21:40:40.829602 mon.0 10.177.66.4:6790/0 361 : [INF] osd.37
> 10.177.66.6:6839/27397 boot
> 2012-06-11 21:40:41.828776    pg v88205: 200 pgs: 56 active+clean, 4
> stale+active+clean, 3 peering, 137 active+degraded; 783 GB data, 1928
> GB used, 18111 GB / 20040 GB avail; 76921/254952 degraded (30.171%)
> 2012-06-11 21:40:41.831823   osd e4166: 72 osds: 68 up, 72 in
> 2012-06-11 21:40:41.828713 mon.0 10.177.66.4:6790/0 362 : [INF] osd.38
> 10.177.66.6:6842/27513 boot
> 2012-06-11 21:40:41.829440 mon.0 10.177.66.4:6790/0 363 : [INF] osd.39
> 10.177.66.6:6845/27628 boot
> 2012-06-11 21:40:41.830226 mon.0 10.177.66.4:6790/0 364 : [INF] osd.40
> 10.177.66.6:6848/27835 boot
> 2012-06-11 21:40:41.830531 mon.0 10.177.66.4:6790/0 365 : [INF] osd.41
> 10.177.66.6:6851/27950 boot
> 2012-06-11 21:40:41.830778 mon.0 10.177.66.4:6790/0 366 : [INF] osd.42
> 10.177.66.6:6854/28065 boot
> 2012-06-11 21:40:41.831249 mon.0 10.177.66.4:6790/0 367 : [INF] osd.43
> 10.177.66.6:6857/28181 boot
> 2012-06-11 21:40:42.830440    pg v88206: 200 pgs: 57 active+clean, 4
> stale+active+clean, 7 peering, 132 active+degraded; 783 GB data, 1928
> GB used, 18111 GB / 20040 GB avail; 75543/254952 degraded (29.630%)
> 2012-06-11 21:40:42.833294   osd e4167: 72 osds: 72 up, 72 in
> 2012-06-11 21:40:42.831046 mon.0 10.177.66.4:6790/0 368 : [INF] osd.44
> 10.177.66.6:6860/28373 boot
> 2012-06-11 21:40:42.832004 mon.0 10.177.66.4:6790/0 369 : [INF] osd.45
> 10.177.66.6:6863/28489 boot
> 2012-06-11 21:40:42.832314 mon.0 10.177.66.4:6790/0 370 : [INF] osd.46
> 10.177.66.6:6866/28607 boot
> 2012-06-11 21:40:42.832545 mon.0 10.177.66.4:6790/0 371 : [INF] osd.47
> 10.177.66.6:6869/28731 boot
> 2012-06-11 21:40:43.830481    pg v88207: 200 pgs: 64 active+clean, 4
> stale+active+clean, 7 peering, 125 active+degraded; 783 GB data, 1928
> GB used, 18111 GB / 20040 GB avail; 72874/254952 degraded (28.583%)
> 2012-06-11 21:40:43.831113   osd e4168: 72 osds: 72 up, 72 in
> 2012-06-11 21:40:44.832521    pg v88208: 200 pgs: 79 active+clean, 1
> stale+active+clean, 4 peering, 113 active+degraded, 3
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail; 66185/254952 degraded (25.960%)
> 2012-06-11 21:40:45.834077    pg v88209: 200 pgs: 104 active+clean, 1
> stale+active+clean, 4 peering, 85 active+degraded, 6
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail; 50399/254952 degraded (19.768%)
> 2012-06-11 21:40:46.835367    pg v88210: 200 pgs: 125 active+clean, 1
> stale+active+clean, 4 peering, 59 active+degraded, 11
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail; 38563/254952 degraded (15.126%)
> 2012-06-11 21:40:47.836516    pg v88211: 200 pgs: 158 active+clean, 1
> stale+active+clean, 26 active+degraded, 15 active+recovering; 783 GB
> data, 1928 GB used, 18111 GB / 20040 GB avail; 18542/254952 degraded
> (7.273%)
> 2012-06-11 21:40:48.853560    pg v88212: 200 pgs: 184 active+clean, 16
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail; 1/254952 degraded (0.000%)
> 2012-06-11 21:40:49.868514    pg v88213: 200 pgs: 184 active+clean, 16
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail; 1/254952 degraded (0.000%)
> 2012-06-11 21:40:50.858244    pg v88214: 200 pgs: 184 active+clean, 16
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail; 1/254952 degraded (0.000%)
> 2012-06-11 21:40:51.845622    pg v88215: 200 pgs: 184 active+clean, 16
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail
> 2012-06-11 21:40:52.857823    pg v88216: 200 pgs: 184 active+clean, 16
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail
> 2012-06-11 21:40:53.858281    pg v88217: 200 pgs: 184 active+clean, 16
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail
> 2012-06-11 21:40:54.855602    pg v88218: 200 pgs: 184 active+clean, 16
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail
> 2012-06-11 21:40:55.857241    pg v88219: 200 pgs: 184 active+clean, 16
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail
> 2012-06-11 21:40:56.857631    pg v88220: 200 pgs: 184 active+clean, 16
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail
> 2012-06-11 21:40:57.858987    pg v88221: 200 pgs: 185 active+clean, 15
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail
> 2012-06-11 21:40:58.880252    pg v88222: 200 pgs: 185 active+clean, 15
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail
> 2012-06-11 21:40:59.861910    pg v88223: 200 pgs: 188 active+clean, 12
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail
> 2012-06-11 21:41:00.902582    pg v88224: 200 pgs: 191 active+clean, 9
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail
> 2012-06-11 21:41:01.907767    pg v88225: 200 pgs: 196 active+clean, 4
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail
> 2012-06-11 21:41:02.876377    pg v88226: 200 pgs: 199 active+clean, 1
> active+recovering; 783 GB data, 1928 GB used, 18111 GB / 20040 GB
> avail
> 2012-06-11 21:41:03.876929    pg v88227: 200 pgs: 200 active+clean;
> 783 GB data, 1928 GB used, 18111 GB / 20040 GB avail
>
> <disk type="network" device="disk">
>          <driver name="qemu" type="raw"/>
>          <source protocol="rbd" name="rbd/foo4">
>          </source>
>          <target dev="vdf" bus="virtio"/>
> </disk>
>
> 2. When i use rbd_cache=1, or true in my xml, for libvirt i get:
>
> <disk type="network" device="disk">
>          <driver name="qemu" type="raw"/>
>          <source protocol="rbd" name="rbd/foo5:rbd_cache=1">
>          </source>
>          <target dev="vdf" bus="virtio"/>
> </disk>
>
> libvirtd.log
> 2012-06-11 18:50:36.992+0000: 1751: error :
> qemuMonitorTextAddDrive:2820 : operation failed: open disk image file
> failed
>
> Libvirt version 0.9.8-2ubuntu17 with some additional patch set before
> ceph 0.46 version appears. Qemu-kvm 1.0+noroms-0ubuntu13.
>
> Do i need any other patch for libvirt ?? Without rbd_cache attaching is ok.
>
> --
> -----
> Pozdrawiam
>
> Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html