Inconsistent PG automatically got "repaired" automatically?

Nikos Kormpakis <nkorb@xxxxxxxxxxxx> · Wed, 09 May 2018 18:20:46 +0300

Hello,

we operate a Ceph cluster running v12.2.4, on top of Debian Stretch, 
deployed
with ceph-volume lvm with a default crushmap and a quite vanilla 
ceph.conf. OSDs
live on single disks in JBOD mode, with a separate block.db LV on a 
shared SSD.
We have a single pool (min_size=2, size=3) on this cluster used for RBD 
volumes.

During deep-scrubbing, we ran across an inconsistent PG. Specifically:

```
root@rd2-0427:~# rados list-inconsistent-obj 10.193c -f json-pretty
{
    "epoch": 10218,
    "inconsistents": [
        {
            "object": {
                "name": "rbd_data.a0a7c12ae8944a.0000000000008a89",
                "nspace": "",
                "locator": "",
                "snap": "head",
                "version": 8068
            },
            "errors": [],
            "union_shard_errors": [
                "read_error"
            ],
            "selected_object_info": 
"10:3c9c88a6:::rbd_data.a0a7c12ae8944a.0000000000008a89:head(11523'8068 
client.10490700.0:168994 dirty|omap_digest s 4194304 uv 8068 od ffffffff 
alloc_hint [4194304 4194304 0])",
            "shards": [
                {
                    "osd": 64,
                    "primary": false,
                    "errors": [],
                    "size": 4194304,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xfd46c017"
                },
                {
                    "osd": 84,
                    "primary": true,
                    "errors": [
                        "read_error"
                    ],
                    "size": 4194304
                },
                {
                    "osd": 122,
                    "primary": false,
                    "errors": [],
                    "size": 4194304,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xfd46c017"
                }
            ]
        }
    ]
}
```

This behavior is quite common. I/O errors happen often due to faulty 
disks,
which we can confirm in this case:

```
Apr 25 10:45:56 rd2-0530 kernel: [486531.949072] sd 1:0:6:0: [sdm] tag#1 
FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 25 10:45:56 rd2-0530 kernel: [486531.964262] sd 1:0:6:0: [sdm] tag#1 
Sense Key : Medium Error [current]
Apr 25 10:45:56 rd2-0530 kernel: [486531.977582] sd 1:0:6:0: [sdm] tag#1 
Add. Sense: Read retries exhausted
Apr 25 10:45:56 rd2-0530 kernel: [486531.990798] sd 1:0:6:0: [sdm] tag#1 
CDB: Read(16) 88 00 00 00 00 00 0c b1 be 80 00 00 02 00 00 00
Apr 25 10:45:56 rd2-0530 kernel: [486532.006759] blk_update_request: 
critical medium error, dev sdm, sector 212975501
Apr 25 10:45:56 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:56.857351 
7f396a798700 -1 bdev(0x561b7fd14d80 /var/lib/ceph/osd/ceph-84/block) 
_aio_thread got (5) Input/output error
Apr 25 10:45:58 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:58.730731 
7f395af79700 -1 log_channel(cluster) log [ERR] : 10.193c shard 84: soid 
10:3c9c88a6:::rbd_data.a0a7c12ae8944a.0000000000008a89:head candidate 
had a read error
Apr 25 10:45:59 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:59.709004 
7f395af79700 -1 log_channel(cluster) log [ERR] : 10.193c deep-scrub 0 
missing, 1 inconsistent objects
Apr 25 10:45:59 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:59.709013 
7f395af79700 -1 log_channel(cluster) log [ERR] : 10.193c deep-scrub 1 
errors
```

Also, osd.84, the one with the faulty shard, appears to be the
up_primary+acting_primary osd of this pg (10.193c).

Before issuing `ceph pg repair` or setting the problematic osd out 
(osd.84),
we wanted to observe how Ceph behaves when trying to read an object with 
a
problematic primary shard. So, we issued from a cluster node:

```
# rados -p rbd_vima get rbd_data.a0a7c12ae8944a.0000000000008a89 
foobartest
```

The command completed successfully after 30-35 seconds and wrote the 
object
correctly to the destination file. During these 30 seconds we observed 
the
following:

```
2018-05-02 16:31:37.934812 mon.rd2-0427 mon.0 
[2001:648:2ffa:198::68]:6789/0 247500 : cluster [WRN] Health check 
update: 188 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-05-02 16:31:39.772436 mon.rd2-0427 mon.0 
[2001:648:2ffa:198::68]:6789/0 247501 : cluster [WRN] Health check 
failed: Degraded data redundancy: 1/3227247 objects degraded (0.000%), 1 
pg degraded (PG_DEGRADED)
2018-05-02 16:31:42.934839 mon.rd2-0427 mon.0 
[2001:648:2ffa:198::68]:6789/0 247504 : cluster [WRN] Health check 
update: 198 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-05-02 16:31:42.934893 mon.rd2-0427 mon.0 
[2001:648:2ffa:198::68]:6789/0 247505 : cluster [INF] Health check 
cleared: PG_DEGRADED (was: Degraded data redundancy: 1/3227247 objects 
degraded (0.000%), 1 pg degraded)
2018-05-02 16:31:37.384611 osd.84 osd.84 
[2001:648:2ffa:198::77]:6834/1031919 462 : cluster [ERR] 10.193c missing 
primary copy of 
10:3c9c88a6:::rbd_data.a0a7c12ae8944a.0000000000008a89:head, will try 
copies on 64,122
2018-05-02 16:31:47.935923 mon.rd2-0427 mon.0 
[2001:648:2ffa:198::68]:6789/0 247510 : cluster [WRN] Health check 
update: 142 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-05-02 16:31:48.980015 mon.rd2-0427 mon.0 
[2001:648:2ffa:198::68]:6789/0 247511 : cluster [INF] Health check 
cleared: REQUEST_SLOW (was: 142 slow requests are blocked > 32 sec)
```

After that, we tried to read again the problematic object, which 
succeeded
instantly, without any delay. Unfortunately, we did not investigated 
these
slow requests, but our cluster does not have any load at all right now 
and
no failures happened at the same time. So, we suspect that our rados 
command
caused these requests somehow. We also ran deep-scrub on pg 10.193c 
which did
not report any errors at all:

```
# ceph pg deep-scrub 10.193c
instructing pg 10.193c on osd.84 to deep-scrub

rd2-0530 # tail -f /var/log/ceph/ceph-osd.84.log
2018-05-02 16:38:23.070692 7fa3213cf700  0 log_channel(cluster) log 
[DBG] : 10.193c deep-scrub starts
2018-05-02 16:38:29.327847 7fa3213cf700  0 log_channel(cluster) log 
[DBG] : 10.193c deep-scrub ok
```

After deep-scrubbing, our cluster turned to HEALTH_OK state again. We 
also
checked if something else happened and the problematic PG got remapped 
some
how (ie osd failure), but we did not observe something like that.

So, at this part, we are stuck with a lot of questions:

1) After how much time RADOS tries to read from a secondary replica? Is 
this
   timeout configurable?
2) If a primary shard is missing, Ceph tries to recreate it somehow
   automatically?
3) If Ceph recreates the primary shard (even automatically, or with 
`ceph
   pg repair`, why did we not observe IO errors again? Does BlueStore 
know
   which disk blocks are bad an somehow avoids them or the same object 
can
   be stored on different blocks if recreated? Unfortunately, I'm not
   familiar with its internals.
4) Is there any reason why did slow requests appear? Can we correlate 
these
   requests somehow with our problem?

This behavior looks very confusing from a first sight and we'd really 
want
to know what is happening and what is Ceph doing internally. I'd really
appreciate any insights or pointers.

Best regards,
--
Nikos Kormpakis - nkorb@xxxxxxxxxxxx
Network Operations Center, Greek Research & Technology Network
Tel: +30 210 7475712 - http://www.grnet.gr
7, Kifisias Av., 115 23 Athens, Greece
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com