Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Sebastian Mazza <sebastian@xxxxxxxxxxx> · Thu, 10 Mar 2022 20:37:49 +0100

Hi Igor!

I hope I've cracked the checkpot now. I have logs with osd debug level 20 for bluefs, bdev, and bluestore. The log files ceph-osd.4.log shows 2 consecutive startups of the osd.4 where the second startup results in:
```
rocksdb: Corruption: Bad table magic number: expected 9863518390377041911, found 0 in db/002193.sst
```

The log ceph-osd.4.log shows the last two starts of the OSD. However, I send you also the previous log that contains the “*** Immediate shutdown (osd_fast_shutdown=true) ***” that happened before the last successful OSD startup, as the last line. It is the last line, because the server was down for about 39 hours. Based on what you explained to me, the corruption of the rocks DB must have occurred during the OSD startup, which starts at the first line of ceph-osd.4.log.

Logs and the corrupted rocks DB file: https://we.tl/t-yVsF3akxqX

Does the logs contain what you need?
Pleas tell me If you need more data from the OSD. If not I would rebuild it.

Best wishes,
Sebastian

> On 26.02.2022, at 00:12, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
> 
> Sebastian,
> 
> On 2/25/2022 7:17 PM, Sebastian Mazza wrote:
>> Hi Igor,
>> 
>>> Unfortunately the last logs (from 24-02-2022) you shared don't include the point where actual corruption happened as the previois log did. These new logs miss the last successful OSD start which apparently corrupts DB data. Do you have any output prior to their content?
>> I have logs from the last successful startup of all OSDs, since this was the cluster boot where the OSD.7 failed (21-02-2022 at 13:49:28.452+0100). That also means that the logs does not include the bluestore debug info, only bluefs and bdev. I will include all logs that I have for the crashed OSD 5 and 6. The last successful startup is inside the files that ends with “4”. While the startups are collected with  bluefs and bdev debug infos, the logs between the startup (21-02-2022 at 13:49:28.452+0100) and the crash (2022-02-24T00:04:57.874+0100) was collected without any debug logging, since during that time I have rebuild OSD 7.
>> 
>> Logs: https://we.tl/t-x0la8gZl0S
> Unfortunately they're of little help without verbose output :(
>> 
>> Since I have to manually rebuild the the cluster, pools, cephFS, and RBD images now, It will take several days before I can try to reproduce the problem.
>> 
>> Should, or could I enable more verbose logging then
>> ```
>> [osd]
>>         debug bluefs = 20
>>         debug bdev = 20
>>         debug bluestore = 20
>> ```
>> in ceph.conf? Or somehow else?
> 
> This can be done either through ceph.conf or using "ceph config set" CLI command.
> 
> To make resulting log less you can use a bit less verbose logging levels:
> 
> debug bluefs = 10
> 
> debug bdev = 5
> 
> debug bluestore = 20
> 
> 
>> 
>> Thanks,
>> Sebastian
>> 
>> 
>> 
>>> On 25.02.2022, at 16:18, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
>>> 
>>> Hi Sebastian,
>>> 
>>> I submitted a ticket https://tracker.ceph.com/issues/54409 which shows my analysis based on your previous log (from 21-02-2022). Which wasn't verbose enough at debug-bluestore level to make the final conclusion.
>>> 
>>> Unfortunately the last logs (from 24-02-2022) you shared don't include the point where actual corruption happened as the previois log did. These new logs miss the last successful OSD start which apparently corrupts DB data. Do you have any output prior to their content?
>>> 
>>> If not could you please reproduce that once again? Generally  I'd like to see OSD log for a broken startup along with a couple of restarts back - the event sequence for the failure seems to be as follows:
>>> 
>>> 1) OSD is shutdown for the first time. It (for uncear reasons) keeps a set of deferred writes to be applied once again.
>>> 
>>> 2) OSD is started up which triggers deferred writes submissions. They overlap (again for unclear reasons so far) with DB data content written shortly before. The OSD starts properly but DB data corruption has happened at this point
>>> 
>>> 3) OSD is restarted again which reveals the data corruption and since that point OSD is unable to start.
>>> 
>>> So these last logs new logs include 3) only for now. While I need 1) & 2) as well...
>>> 
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> On 2/24/2022 3:04 AM, Sebastian Mazza wrote:
>>>> Hi Igor,
>>>> 
>>>> I let ceph rebuild the OSD.7. Then I added
>>>> ```
>>>> [osd]
>>>> 	debug bluefs = 20
>>>>         debug bdev = 20
>>>>         debug bluestore = 20
>>>> ```
>>>> to the ceph.conf of all 3 nodes and shut down all 3 nodes without writing anything to the pools on the HDDs (the Debian VM was not even running).
>>>> Immediately at the first boot OSD.5 and 6 crashed with the same “Bad table magic number” error. The OSDs 5 and 6 are on the same node, but not on the node of OSD 7, wich crashed the last two times.
>>>> 
>>>> Logs and corrupted rocks DB Files: https://we.tl/t-ZBXYp8r4Hq
>>>> I have saved the entire /var/log directory of every node and the result of
>>>> ```
>>>> $ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-5 --out-dir /tmp/osd.5-data
>>>> $ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-6 --out-dir /tmp/osd.6-data
>>>> ```
>>>> Let me know if you need something else.
>>>> 
>>>> 
>>>> I hop you can now track it down. I'm really looking forward to your interpretation of the logs.
>>>> 
>>>> 
>>>> Best Regards,
>>>> Sebastian
>>>> 
>>>> 
>>>>> On 22.02.2022, at 11:44, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
>>>>> 
>>>>> Hi Sebastian,
>>>>> 
>>>>> On 2/22/2022 3:01 AM, Sebastian Mazza wrote:
>>>>>> Hey Igor!
>>>>>> 
>>>>>> 
>>>>>>> thanks a lot for the new logs - looks like they provides some insight.
>>>>>> I'm glad the logs are helpful.
>>>>>> 
>>>>>> 
>>>>>>> At this point I think the root cause is apparently a race between deferred writes replay and some DB maintenance task happening on OSD startup. It seems that deferred write replay updates a block extent which RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros content. Evidently that's just a matter of chance whether they use conflicting physical extent or not hence the occasional nature of the issue...
>>>>>> Do I understand that correct: The corruption of the rocksDB (Table overwritten by zeros) happens at the first start of the OSD after  “*** Immediate shutdown (osd_fast_shutdown=true) ***”? Before the system launches the OSD Service the RocksDB is still fine?
>>>>> Looks like that. From logs I can see an unexpected write to specific extent (LBA 0x630000) which shouldn't occur and at which RocksDB subsequently fails.
>>>>>>> So first of all I'm curious if you have any particular write patterns that can be culprits? E.g. something like disk wiping procedure which writes all-zeros to an object followed by object truncate or removal comes to my mind. If you can identify something like that - could you please collect OSD log for such an operation (followed by OSD restart) with debug-bluestore set to 20?
>>>>>> Best to my knowledge the OSD was hardly doing anything and I do not see any pattern that would fit to you explanation.
>>>>>> However, you certainly understand a lot more about it than I do, so I try to explain everything that could be relevant.
>>>>>> 
>>>>>> The Cluster has 3 Nodes. Each has a 240GB NVMe m.2 SSD as boot drive, which should not be relevant. Each node has 3 OSDs, one is on an U.2 NVMe SSD with 2TB and the other two are on 12TB HDDs.
>>>>>> 
>>>>>> I have configured two crush rules ‘c3nvme’ and ‘ec4x2hdd’. The ‘c3nvme’ is a replicated rule that uses only OSDs with class ’nvme’. The second rule is a tricky erasure rule. It selects exactly 2 OSDs on exactly 4 Hosts with class ‘hdd’. So it only works for a size of exactly 8. That means that a pool that uses this rule has always only “undersized” placement groups, since the cluster has only 3 nodes. (I did not add the fourth server after the first crash in December, since we want to reproduce the problem.)
>>>>>> 
>>>>>> The pools device_health_metrics, test-pool, fs.metadata-root-pool, fs.data-root-pool, fs.data-nvme.c-pool, and block-nvme.c-pool uses the crush rule ‘c3nvme’ with a size of 3 and a min size of 2. The pools fs.data-hdd.ec-pool, block-hdd.ec-pool uses the crush rule ‘ec4x2hdd’ with k=5,m=3 and a min size of 6.
>>>>>> 
>>>>>> The pool fs.data-nvme.c-pool is not used and the pool test-pool was used for rados bench a few month ago.
>>>>>> 
>>>>>> The pool fs.metadata-root-pool is used as metadata pool for cephFS and fs.data-root-pool as the root data pool for the cephFS. The pool fs.data-hdd.ec-pool is an additional data pool for the cephFS and is specified as ceph.dir.layout for some folders of the cephFS. The whole cephFS is mounted by each of the 3 nodes.
>>>>>> 
>>>>>> The pool block-nvme.c-pool hosts two RBD images that are used as boot drives for two VMs. The first VM runes with Ubuntu Desktop and the second with Debian as OS. The pool block-hdd.ec-pool hosts one RBD image (the data part, metadata on block-nvme.c-pool) that is attached to the Debian VM as second drive formatted with BTRFS. Furthermore the Debian VM mounts a sub directory of the cephFS that has the fs.data-hdd.ec-pool set as layout. Both VMs was doing nothing, except from being booted, in the last couple of days.
>>>>>> 
>>>>>> I try to illustrate the pool usage as a tree:
>>>>>> * c3nvme (replicated, size=3, min_size=2)
>>>>>>     + device_health_metrics
>>>>>>     + test-pool
>>>>>>         - rados bench
>>>>>>     + fs.metadata-root-pool
>>>>>>         - CephFS (metadata)
>>>>>>     + fs.data-root-pool
>>>>>>         - CephFS (data root)
>>>>>>     + fs.data-nvme.c-pool
>>>>>>     + block-nvme.c-pool
>>>>>>         - RBD (Ubuntu VM, boot disk with ext4)
>>>>>> 	- RBD (Debian VM, boot disk with ext4)
>>>>>> * ec4x2hdd (ec,  k=5, m=3, min_size=6)
>>>>>>     + fs.data-hdd.ec-pool
>>>>>>         - CephFS (layout data)
>>>>>>     + block-hdd.ec-pool
>>>>>>         - RBD (Debian VM, disk 2 with BTRFS)
>>>>>> 
>>>>>> 
>>>>>> Last week I was writing about 30TB to the CephFS inside the fs.data-hdd.ec-pool and around 50GB to the BTRFS volume on the block-hdd.ec-pool. The 50GB written to BTRFS contains hundred thousands of hard links. There was already nearly the same data on the storage and I used rsync to update it. I think something between 1 and 4 TB has changed and was updated by rsync.
>>>>>> 
>>>>>> I think the cluster was totally unused on Friday, but up and running and idling around. Then on Saturday I did a graceful shutdown of of all cluster nodes. Arround 5 minutes later when booted the servers again, the OSD.7 crashed. I copied the logs and exported the RocksDB. Then I deleted everything from the HDD and deployed the OSD.7 again. When, I checkt for the first time today at around 12:00, ceph was already finished with backfilling to OSD.7 and the cluster was idle again.
>>>>>> 
>>>>>> I then spend 70 minutes with writing 3 small files (one with about 500Byte and two with about 2KB) to the cephFS on the fs.data-hdd.ec-pool and shutting down the whole cluster. After I booted all 3 nodes again I did overwrite the 3 small files (some Bytes have changed) and shouted down all 3 nodes. I repeated that procedure 4 times within 70 minutes. Then the OSD 7 crashed again.
>>>>>> 
>>>>>> There may be was some more IO caused by mounting the RBD images with the BTRFS volume inside the Debian VM. But I do not know. I can only say that no user did use it. Only I was writing 4 times the 3 small files to cephFS on the fs.data-hdd.ec-pool.
>>>>>> 
>>>>>> Does this make any sense to you?
>>>>> Well, that's actually not a write pattern I would expect to cause the issue. But thanks a lot anyway. So could you please repeat that procedure with overwriting 3 small files and OSD restart and collect the logs with debug-bluestore set to 20.
>>>>>>> with debug-bluestore set to 20
>>>>>> Do you mean
>>>>>> ```
>>>>>> [osd]
>>>>>> 	debug bluestore = 20
>>>>>> ```
>>>>>> in ceph.conf?
>>>>> yes.  Or you might use 'ceph config set osd.7 debug-bluestore 20' instead
>>>>> 
>>>>>> Thanks,
>>>>>> Sebastian
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 21.02.2022, at 21:19, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
>>>>>>> 
>>>>>>> Hey Sebastian,
>>>>>>> 
>>>>>>> thanks a lot for the new logs - looks like they provides some insight. At this point I think the root cause is apparently a race between deferred writes replay and some DB maintenance task happening on OSD startup. It seems that deferred write replay updates a block extent which RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros content. Evidently that's just a matter of chance whether they use conflicting physical extent or not hence the occasional nature of the issue...
>>>>>>> 
>>>>>>> And now I'd like to determine what's wrong with this deferred write replay.
>>>>>>> 
>>>>>>> So first of all I'm curious if you have any particular write patterns that can be culprits? E.g. something like disk wiping procedure which writes all-zeros to an object followed by object truncate or removal comes to my mind. If you can identify something like that - could you please collect OSD log for such an operation (followed by OSD restart) with debug-bluestore set to 20?
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Igor
>>>>>>> 
>>>>>>> On 2/21/2022 5:29 PM, Sebastian Mazza wrote:
>>>>>>>> Hi Igor,
>>>>>>>> 
>>>>>>>> today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And this time I have logs with “debug bluefs = 20” and "debug bdev = 20” for every OSD in the cluster! It was the OSD with the ID 7 again. So the HDD has failed now the third time! Coincidence? Probably not…
>>>>>>>> The important thing seams to be that a shutdown and not only a restart of the entire cluster is performed. Since, this time the OSD failed after just 4 shutdowns of all nodes in the cluster within 70 minutes.
>>>>>>>> 
>>>>>>>> I redeployed the OSD.7 after the crash from 2 days ago. And I started this new shutdown and boot series shortly after ceph had finished writing everything back to OSD.7, earlier today.
>>>>>>>> 
>>>>>>>> The corrupted RocksDB file (crash) is again only 2KB in size.
>>>>>>>> You can download the RocksDB file with the bad  table magic number and the log of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ
>>>>>>>> What else do you want?
>>>>>>>> 
>>>>>>>> From the log of the OSD.7:
>>>>>>>> —————
>>>>>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000
>>>>>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r 4096 ioc 0x55f0b8dbdd18 with 0 aios left
>>>>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received  signal: Terminated from /sbin/init  (PID: 1) UID: 0
>>>>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal Terminated ***
>>>>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate shutdown (osd_fast_shutdown=true) ***
>>>>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 set uid:gid to 64045:64045 (ceph:ceph)
>>>>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process ceph-osd, pid 1967
>>>>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 pidfile_write: ignore empty --pid-file
>>>>>>>> 2022-02-21T13:53:40.459+0100 7fc9645f4f00  1 bdev(0x55bd400a0800 /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
>>>>>>>> —————
>>>>>>>> 
>>>>>>>> For me this looks like that the OSD did nothing for nearly 2 minutes before it receives the termination request. Shouldn't this be enough time for flushing every imaginable write cache?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I hope this helps you.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Best wishes,
>>>>>>>> Sebastian
>>>>>>>> 
>>>>>>> -- 
>>>>>>> Igor Fedotov
>>>>>>> Ceph Lead Developer
>>>>>>> 
>>>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>>> 
>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>> -- 
>>>>> Igor Fedotov
>>>>> Ceph Lead Developer
>>>>> 
>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>> 
>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>> -- 
>>> Igor Fedotov
>>> Ceph Lead Developer
>>> 
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>> 
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> 
> -- 
> Igor Fedotov
> Ceph Lead Developer
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx