Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Igor Fedotov <igor.fedotov@xxxxxxxx> · Fri, 25 Feb 2022 18:18:54 +0300

Hi Sebastian,

I submitted a ticket https://tracker.ceph.com/issues/54409 which shows 
my analysis based on your previous log (from 21-02-2022). Which wasn't 
verbose enough at debug-bluestore level to make the final conclusion.

Unfortunately the last logs (from 24-02-2022) you shared don't include 
the point where actual corruption happened as the previois log did. 
These new logs miss the last successful OSD start which apparently 
corrupts DB data. Do you have any output prior to their content?

If not could you please reproduce that once again? Generally  I'd like 
to see OSD log for a broken startup along with a couple of restarts back 
- the event sequence for the failure seems to be as follows:

1) OSD is shutdown for the first time. It (for uncear reasons) keeps a 
set of deferred writes to be applied once again.

2) OSD is started up which triggers deferred writes submissions. They 
overlap (again for unclear reasons so far) with DB data content written 
shortly before. The OSD starts properly but DB data corruption has 
happened at this point

3) OSD is restarted again which reveals the data corruption and since 
that point OSD is unable to start.

So these last logs new logs include 3) only for now. While I need 1) & 
2) as well...

Thanks,

Igor

On 2/24/2022 3:04 AM, Sebastian Mazza wrote:
Hi Igor,

I let ceph rebuild the OSD.7. Then I added
```
[osd]
	debug bluefs = 20
         debug bdev = 20
         debug bluestore = 20
```
to the ceph.conf of all 3 nodes and shut down all 3 nodes without writing anything to the pools on the HDDs (the Debian VM was not even running).
Immediately at the first boot OSD.5 and 6 crashed with the same “Bad table magic number” error. The OSDs 5 and 6 are on the same node, but not on the node of OSD 7, wich crashed the last two times.

Logs and corrupted rocks DB Files: https://we.tl/t-ZBXYp8r4Hq
I have saved the entire /var/log directory of every node and the result of
```
$ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-5 --out-dir /tmp/osd.5-data
$ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-6 --out-dir /tmp/osd.6-data
```
Let me know if you need something else.

I hop you can now track it down. I'm really looking forward to your interpretation of the logs.

Best Regards,
Sebastian

On 22.02.2022, at 11:44, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:

Hi Sebastian,

On 2/22/2022 3:01 AM, Sebastian Mazza wrote:
Hey Igor!

thanks a lot for the new logs - looks like they provides some insight.
I'm glad the logs are helpful.

At this point I think the root cause is apparently a race between deferred writes replay and some DB maintenance task happening on OSD startup. It seems that deferred write replay updates a block extent which RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros content. Evidently that's just a matter of chance whether they use conflicting physical extent or not hence the occasional nature of the issue...
Do I understand that correct: The corruption of the rocksDB (Table overwritten by zeros) happens at the first start of the OSD after  “*** Immediate shutdown (osd_fast_shutdown=true) ***”? Before the system launches the OSD Service the RocksDB is still fine?
Looks like that. From logs I can see an unexpected write to specific extent (LBA 0x630000) which shouldn't occur and at which RocksDB subsequently fails.

So first of all I'm curious if you have any particular write patterns that can be culprits? E.g. something like disk wiping procedure which writes all-zeros to an object followed by object truncate or removal comes to my mind. If you can identify something like that - could you please collect OSD log for such an operation (followed by OSD restart) with debug-bluestore set to 20?
Best to my knowledge the OSD was hardly doing anything and I do not see any pattern that would fit to you explanation.
However, you certainly understand a lot more about it than I do, so I try to explain everything that could be relevant.

The Cluster has 3 Nodes. Each has a 240GB NVMe m.2 SSD as boot drive, which should not be relevant. Each node has 3 OSDs, one is on an U.2 NVMe SSD with 2TB and the other two are on 12TB HDDs.

I have configured two crush rules ‘c3nvme’ and ‘ec4x2hdd’. The ‘c3nvme’ is a replicated rule that uses only OSDs with class ’nvme’. The second rule is a tricky erasure rule. It selects exactly 2 OSDs on exactly 4 Hosts with class ‘hdd’. So it only works for a size of exactly 8. That means that a pool that uses this rule has always only “undersized” placement groups, since the cluster has only 3 nodes. (I did not add the fourth server after the first crash in December, since we want to reproduce the problem.)

The pools device_health_metrics, test-pool, fs.metadata-root-pool, fs.data-root-pool, fs.data-nvme.c-pool, and block-nvme.c-pool uses the crush rule ‘c3nvme’ with a size of 3 and a min size of 2. The pools fs.data-hdd.ec-pool, block-hdd.ec-pool uses the crush rule ‘ec4x2hdd’ with k=5,m=3 and a min size of 6.

The pool fs.data-nvme.c-pool is not used and the pool test-pool was used for rados bench a few month ago.

The pool fs.metadata-root-pool is used as metadata pool for cephFS and fs.data-root-pool as the root data pool for the cephFS. The pool fs.data-hdd.ec-pool is an additional data pool for the cephFS and is specified as ceph.dir.layout for some folders of the cephFS. The whole cephFS is mounted by each of the 3 nodes.

The pool block-nvme.c-pool hosts two RBD images that are used as boot drives for two VMs. The first VM runes with Ubuntu Desktop and the second with Debian as OS. The pool block-hdd.ec-pool hosts one RBD image (the data part, metadata on block-nvme.c-pool) that is attached to the Debian VM as second drive formatted with BTRFS. Furthermore the Debian VM mounts a sub directory of the cephFS that has the fs.data-hdd.ec-pool set as layout. Both VMs was doing nothing, except from being booted, in the last couple of days.

I try to illustrate the pool usage as a tree:
* c3nvme (replicated, size=3, min_size=2)
     + device_health_metrics
     + test-pool
         - rados bench
     + fs.metadata-root-pool
         - CephFS (metadata)
     + fs.data-root-pool
         - CephFS (data root)
     + fs.data-nvme.c-pool
     + block-nvme.c-pool
         - RBD (Ubuntu VM, boot disk with ext4)
	- RBD (Debian VM, boot disk with ext4)
* ec4x2hdd (ec,  k=5, m=3, min_size=6)
     + fs.data-hdd.ec-pool
         - CephFS (layout data)
     + block-hdd.ec-pool
         - RBD (Debian VM, disk 2 with BTRFS)

Last week I was writing about 30TB to the CephFS inside the fs.data-hdd.ec-pool and around 50GB to the BTRFS volume on the block-hdd.ec-pool. The 50GB written to BTRFS contains hundred thousands of hard links. There was already nearly the same data on the storage and I used rsync to update it. I think something between 1 and 4 TB has changed and was updated by rsync.

I think the cluster was totally unused on Friday, but up and running and idling around. Then on Saturday I did a graceful shutdown of of all cluster nodes. Arround 5 minutes later when booted the servers again, the OSD.7 crashed. I copied the logs and exported the RocksDB. Then I deleted everything from the HDD and deployed the OSD.7 again. When, I checkt for the first time today at around 12:00, ceph was already finished with backfilling to OSD.7 and the cluster was idle again.

I then spend 70 minutes with writing 3 small files (one with about 500Byte and two with about 2KB) to the cephFS on the fs.data-hdd.ec-pool and shutting down the whole cluster. After I booted all 3 nodes again I did overwrite the 3 small files (some Bytes have changed) and shouted down all 3 nodes. I repeated that procedure 4 times within 70 minutes. Then the OSD 7 crashed again.

There may be was some more IO caused by mounting the RBD images with the BTRFS volume inside the Debian VM. But I do not know. I can only say that no user did use it. Only I was writing 4 times the 3 small files to cephFS on the fs.data-hdd.ec-pool.

Does this make any sense to you?
Well, that's actually not a write pattern I would expect to cause the issue. But thanks a lot anyway. So could you please repeat that procedure with overwriting 3 small files and OSD restart and collect the logs with debug-bluestore set to 20.
with debug-bluestore set to 20
Do you mean
```
[osd]
	debug bluestore = 20
```
in ceph.conf?
yes.  Or you might use 'ceph config set osd.7 debug-bluestore 20' instead

Thanks,
Sebastian

On 21.02.2022, at 21:19, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:

Hey Sebastian,

thanks a lot for the new logs - looks like they provides some insight. At this point I think the root cause is apparently a race between deferred writes replay and some DB maintenance task happening on OSD startup. It seems that deferred write replay updates a block extent which RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros content. Evidently that's just a matter of chance whether they use conflicting physical extent or not hence the occasional nature of the issue...

And now I'd like to determine what's wrong with this deferred write replay.

So first of all I'm curious if you have any particular write patterns that can be culprits? E.g. something like disk wiping procedure which writes all-zeros to an object followed by object truncate or removal comes to my mind. If you can identify something like that - could you please collect OSD log for such an operation (followed by OSD restart) with debug-bluestore set to 20?

Thanks,

Igor

On 2/21/2022 5:29 PM, Sebastian Mazza wrote:
Hi Igor,

today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And this time I have logs with “debug bluefs = 20” and "debug bdev = 20” for every OSD in the cluster! It was the OSD with the ID 7 again. So the HDD has failed now the third time! Coincidence? Probably not…
The important thing seams to be that a shutdown and not only a restart of the entire cluster is performed. Since, this time the OSD failed after just 4 shutdowns of all nodes in the cluster within 70 minutes.

I redeployed the OSD.7 after the crash from 2 days ago. And I started this new shutdown and boot series shortly after ceph had finished writing everything back to OSD.7, earlier today.

The corrupted RocksDB file (crash) is again only 2KB in size.
You can download the RocksDB file with the bad  table magic number and the log of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ
What else do you want?

 From the log of the OSD.7:
—————
2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000
2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r 4096 ioc 0x55f0b8dbdd18 with 0 aios left
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received  signal: Terminated from /sbin/init  (PID: 1) UID: 0
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal Terminated ***
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate shutdown (osd_fast_shutdown=true) ***
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 set uid:gid to 64045:64045 (ceph:ceph)
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process ceph-osd, pid 1967
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 pidfile_write: ignore empty --pid-file
2022-02-21T13:53:40.459+0100 7fc9645f4f00  1 bdev(0x55bd400a0800 /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
—————

For me this looks like that the OSD did nothing for nearly 2 minutes before it receives the termination request. Shouldn't this be enough time for flushing every imaginable write cache?

I hope this helps you.

Best wishes,
Sebastian

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx