V58 db migrate fail - piece_expiration.db - 1.108.3

ItsHass · July 23, 2024, 6:13am

grate: v58: database disk image is malformed\n\tstorj.io/storj/private/migrate.(*Migration).Run:212\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:425\n\tmain.cmdRun:100\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Error migrating tables for database on storagenode: migrate: v58: database disk image is malformed\n\tstorj.io/storj/private/migrate.(*Migration).Run:212\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:425\n\tmain.cmdRun:100\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:102\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}
Error: Error migrating tables for database on storagenode: migrate: v58: database disk image is malformed
        storj.io/storj/private/migrate.(*Migration).Run:212
        storj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:425
        main.cmdRun:100
        main.newRunCmd.func1:33
        storj.io/common/process.cleanup.func1.4:393
        storj.io/common/process.cleanup.func1:411
        github.com/spf13/cobra.(*Command).execute:983
        github.com/spf13/cobra.(*Command).ExecuteC:1115
        github.com/spf13/cobra.(*Command).Execute:1039
        storj.io/common/process.ExecWithCustomOptions:112
        main.main:34
        runtime.main:271
2024-07-23 06:00:31,636 INFO stopped: storagenode (exit status 1)

Seem to have an error on DB this morning - has this been common issue ??

Alexey · July 23, 2024, 7:52am

If the database is malformed, you need to fix it:

or re-create it (all historic and current progress will be lost, including a payout estimation):

Why did you upgrade to a beta version, then downgraded? It’s not a normal process.

ItsHass · July 23, 2024, 8:59am

@Alexey I have not touched the node.

It was online for over 400hrs before this happened.

Very worrying that the node decided to take a BETA update ??? How…

Please can you feed that back to whoever needs to know…

I have in the last few hours created a script which actually checks / fixes .db files which is from that help guide. ( i found this after i posted. )

uploaded here happy for contributions to help solve db issues quicker.

#storj_useful_scripts/fix_malformed_database at main · storjdashboard/storj_useful_scripts · GitHub

ItsHass · July 23, 2024, 9:00am

is version 108 beta ??? @Alexey

I have multiple nodes using v1.108.3

Roberto · July 23, 2024, 11:32am

No, 1.108.3 is in rollout

nerdatwork · July 23, 2024, 11:49am

What made you come to that conclusion ?

ItsHass · July 23, 2024, 11:54am

@Alexey has mentioned from reviewing logs I had a beta version, then downgraded ?? - I explained I have not touched the node for quite a few weeks…

Thanks,
Hass

nerdatwork · July 23, 2024, 12:21pm

It must have been some other thread he had in mind but did you try to fix the issue.

ItsHass · July 23, 2024, 2:45pm

I must say - i have only had a few instances where DB’s have failed… the process can be quite long winded…

So i made script to do the same repair job as in the help guide by storj…

I am back up and running

Cheers everyone … and hope anyone passing by feels free to use the script , or contribute to make it better if required.

drew · July 23, 2024, 9:31pm

+1 I have trouble with piece_expiration.db in my 1.108.3 storage node

ItsHass · July 23, 2024, 9:47pm

I wonder if this is a feature/bug of the new version

My little script , might help you

Thanks for adding to this thread - hopefully will get more awareness of the issue occuring.

I have a few nodes, at least 4 are on 1.108.3 - This was the first one to have an issue… So I don’t think its with EVERY node… but it certainly might be with a node that has a slightly slower disk… maybe…

However, all my DBs are on SSD not HDD

Julio · July 23, 2024, 10:50pm

What was the outcome of your fix? I would assume the .db was corrupted prior to upgrade task. You sure you didn’t cut it off thinking it was unresponsive. Did it run out of mem and was killed off once before that logging?
People are best informed that migration can take a while on slow fs, patience needed.
Anyways, sounds like prior corruption then triggered by upgrade; however, was your prior version 1.104.5 or 1.105.4? Cuz I think each version did a db update lately…which would cause corruption going from 1.104.5 --? 1.108.3 or maybe even the short lived 1.107.x.

Kewl that you scripted that & shared to the community, that’s great!

2 cents

Alexey · July 24, 2024, 3:57am

I’m sorry, I thought that you are upgraded the node manually for example to 1.109.x and now it’s downgraded automatically to the rolling version of 1.108.x
It’s very often happen if the version contains a databases migration and the node is downgraded…

I’m glad that you fixed the issue!

I copied your post as a separate topic in troubleshooting with useful-tools tag:

tfoutfou · July 25, 2024, 10:10pm

same problem today , after restarting my server after some windows update

all nodes thrown error

2024-07-25T16:25:51Z ERROR failure during run {“Process”: “storagenode”, “error”: “Error checking version for storagenode database: validate db version mismatch: expected 26, but current version is 60\n\tstorj.io/storj/private/migrate.(*Migration).ValidateVersions:141\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).CheckVersion:671\n\tmain.cmdRun:103\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267”, “errorVerbose”: “Error checking version for storagenode database: validate db version mismatch: expected 26, but current version is 60\n\tstorj.io/storj/private/migrate.(*Migration).ValidateVersions:141\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).CheckVersion:671\n\tmain.cmdRun:103\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267\n\tmain.cmdRun:105\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267”}
2024-07-25T16:25:51Z FATAL Unrecoverable error {“Process”: “storagenode”, “error”: “Error checking version for storagenode database: validate db version mismatch: expected 26, but current version is 60\n\tstorj.io/storj/private/migrate.(*Migration).ValidateVersions:141\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).CheckVersion:671\n\tmain.cmdRun:103\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267”, “errorVerbose”: “Error checking version for storagenode database: validate db version mismatch: expected 26, but current version is 60\n\tstorj.io/storj/private/migrate.(*Migration).ValidateVersions:141\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).CheckVersion:671\n\tmain.cmdRun:103\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267\n\tmain.cmdRun:105\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267”}

did not play with beta version or anything at all before that , had 200h to 400h uptime on nodes before reboot

fixed by making a database backup , delete all database , launch node to make new database , stop node , restore backup , then replacing ONLY piece_expiration.db with the new empty version created earlier

i only fucked one node where i lost my backup , now the dashboard is telling me drive is empty , what can i do ?

tfoutfou · July 25, 2024, 10:23pm

now my database are not consistent i suppose ? (some piece_expiration.db where over 800MB)

what can i do to fix them and regenerate them ? how ?

wich files must be kept if i go the route of regenerating everything from scratch ?

Alexey · July 26, 2024, 7:00am

Just restart the node with the enabled scan on startup (it’s enabled by default), eventually it would fill databases with the actual usage, but may take days. Make sure that you do not have errors related to a filewalkers or the databases.
Unfortunately the

will not be regenerated, so the TTL data will not be removed when it’s expired, but should be collected by a Garbage Collector after a week and after another week would be removed permanently.

ItsHass · July 26, 2024, 7:27am

Yes I’m wondering if this has affected nodes which don’t have SUPER QUICK drives… - Although my thought was it affects DBs , and if they are SSD… thats more than fast enough… even if its a USB stick.

My script to backup / repair the database probably would of fixed it here - The script i created fixed my piece_expiration… but i assume it would of cut off the last few rows which had an issue / incomplete dataset.

@Alexey - Is this worthy of notification to devs to look into this ‘potential’ bigger issue as this gets rolled out further

Alexey · July 26, 2024, 7:46am

They are aware of some issues. This one is reported as well. If it could be reproduced without any interventions like upgrade to the beta and automatically rolled back, then it is a potential bug. Either in the rollout process or in the updater.

If you mean the databases migration incompatibility, then I believe we need to keep this check.

Ruskiem · July 26, 2024, 12:10pm

So i too woke up today to the FATAL v58 db migrate fail in one node.
Windows GUI it is.
I begin to repair the db manually according to the official guide ( How to fix a “database disk image is malformed” )

so the only db with not “ok” with PRAGMA check was “piece_expiration.db”

file is ~500MB,
dump_all.sql is ~130MB
dump_all_notrans.sql is eaxctly same weight ~130MB

after finishing step 13
piece_expiration.db is like 154MB
and put inback in storagenode, node gives new error this time v26 somethign about duplicated table in some trash, migration

What ever, because, at first how i am suppose to repair a .db,
if the node constantly trying to run itself every 2-3 minutes, despite i ordered it to be stopped? also updater i stopped but the node lives its own life! (producing error v26 every so minutes)

so whats now

here :

2024-07-26T12:46:22+01:00	INFO	Anonymized tracing enabled
2024-07-26T12:46:22+01:00	INFO	Operator email	{"Address": "******@****"}
2024-07-26T12:46:22+01:00	INFO	Operator wallet	{"Address": "0x3*******************"}
2024-07-26T12:46:23+01:00	INFO	Telemetry enabled	{"instance ID": "**************"}
2024-07-26T12:46:23+01:00	INFO	Event collection enabled	{"instance ID": "***********************"}
2024-07-26T12:46:23+01:00	INFO	db.migration.26	Add Trash column to pieceExpirationDB
2024-07-26T12:46:23+01:00	ERROR	failure during run	{"error": "Error migrating tables for database on storagenode: migrate: v26: duplicate column name: trash\n\tstorj.io/storj/private/migrate.(*Migration).Run:212\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:425\n\tmain.cmdRun:100\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "errorVerbose": "Error migrating tables for database on storagenode: migrate: v26: duplicate column name: trash\n\tstorj.io/storj/private/migrate.(*Migration).Run:212\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:425\n\tmain.cmdRun:100\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n\tmain.cmdRun:102\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-07-26T12:46:23+01:00	FATAL	Unrecoverable error	{"error": "Error migrating tables for database on storagenode: migrate: v26: duplicate column name: trash\n\tstorj.io/storj/private/migrate.(*Migration).Run:212\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:425\n\tmain.cmdRun:100\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "errorVerbose": "Error migrating tables for database on storagenode: migrate: v26: duplicate column name: trash\n\tstorj.io/storj/private/migrate.(*Migration).Run:212\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:425\n\tmain.cmdRun:100\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n\tmain.cmdRun:102\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-07-26T12:46:26+01:00	INFO	Configuration loaded(...and it just goes in loop every 2 min or so)

(P.S. i have 1 or 2 other Windows nodes working on 1.108.3 from like 29h online i see, seems no problem there, but please don’t roll this 1.108.3, im not gonna run after fixing my nodes like a dog now, its labeled pre-release but still it has been rolled out to my node(allmost all my nodes are on 1.108.3 i checked now and only this 1 a problem so far), while lastest on git hub is still 1.105.4 it shows)

nerdatwork · July 26, 2024, 12:30pm

I got stuck at v59 migration. I edited the db by looking at the code for db changes for v59. I then started the node and the migration went successfully. Now I am stuck with collector warnings about file not found.

2024-07-26T12:26:03Z WARN collector unable to delete piece      { "Satellite ID": 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE "Piece ID": "DXVIUNDADUW3FNLT6YPUCGUU25CZXQRNDWNAGUOCBKVXAWHFK45Q", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:122\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:290\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).DeleteWithStorageFormat:270\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteSkipV0:356\n\tstorj.io/storj/storagenode/collector.(*Service).Collect.func1:88\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:71\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpired:576\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:87\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:65\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:61\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-07-26T12:26:03Z WARN collector unable to delete piece      { "Satellite ID": 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE "Piece ID": "ZN7SDB2D2DFTQNFEMEPWWGBNV2JG7RSM2WC2LZYRXXOIGRZHDUWA", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:122\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:290\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).DeleteWithStorageFormat:270\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteSkipV0:356\n\tstorj.io/storj/storagenode/collector.(*Service).Collect.func1:88\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:71\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpired:576\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:87\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:65\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:61\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

I checked the log and I can see piece ids repeating for the file not found entries. The log is at info level and increasing at the rate of 150 MB per hour.

As a last resort, I am thinking of replacing piece_expiration.db with a blank one and bear with TTL data for a while.