Just a new DB error

JWvdV · May 8, 2024, 6:44am

Today, my node turned out not to work anymore after restart. Log is showing this:

2024-05-08T06:56:21Z    INFO    db.migration.26 Add Trash column to pieceExpirationDB            {"Process": "storagenode"}
2024-05-08T06:56:21Z    ERROR   failure during run      {"Process": "storagenode", "error": "Error migrating tables for database on storagenode: migrate: v26: duplicate column name: trash\n\tstorj.io/storj/private/migrate.(*Migration).Run:212\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:387\n\tmain.cmdRun:98\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267", "errorVerbose": "Error migrating tables for database on storagenode: migrate: v26: duplicate column name: trash\n\tstorj.io/storj/private/migrate.(*Migration).Run:212\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:387\n\tmain.cmdRun:98\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267\n\tmain.cmdRun:100\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267"}
Error: Error migrating tables for database on storagenode: migrate: v26: duplicate column name: trash
        storj.io/storj/private/migrate.(*Migration).Run:212
        storj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:387
        main.cmdRun:98
        main.newRunCmd.func1:33
        storj.io/common/process.cleanup.func1.4:393
        storj.io/common/process.cleanup.func1:411
        github.com/spf13/cobra.(*Command).execute:983
        github.com/spf13/cobra.(*Command).ExecuteC:1115
        github.com/spf13/cobra.(*Command).Execute:1039
        storj.io/common/process.ExecWithCustomOptions:112
        main.main:34
        runtime.main:267
2024-05-08 06:56:21,181 INFO exited: storagenode (exit status 1; not expected)

This error is as far I can see, is more prevalent lately and seems to happen after recurrent upgrade (I use docker, before this happened the docket container was removed and restarted in an older version) > log also shows a trial of addition of the column.

Interestingly, the database file itself doesn’t seems to be corrupted:

root@STORJ4:/storj/DBs# sqlite3 "piece_expiration.db" 'PRAGMA integrity_check;'
ok

So, this seems to be a bug: it tries to migrate an already migrated DB.

Version:

2024-05-08T06:56:12Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.102.3"}
2024-05-08T06:56:12Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode"}
2024-05-08T06:56:12Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.102.3"}
2024-05-08T06:56:12Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode-updater"}

Besides, couldn’t there an option implemented, like: storage.recreateDBOnFailure?

Furthermore, it would be helpful if the reported error mentioned the database which the error pertains to. I now had to read up the log (these rules are extracted from a bigger log): the error is mentioned three times, but the actual filename I had to recover from an INFO-line.

Knowledge · May 8, 2024, 1:33pm

I’ve asked ths team to take a look.

nerdatwork · May 8, 2024, 2:01pm

What were the versions before and after downgrade ?

ifraixedes · May 8, 2024, 3:40pm

Why did you downgrade?

JWvdV · May 8, 2024, 4:15pm

It was the latest docker package, dating back from 3 months ago. I believe version 1.99.3. Don’t know why it isn’t keeping up with the stable version.

JWvdV · May 8, 2024, 4:18pm

I didn’t on purpose downgrade.
But my node failed for unknown reason twice, without log reports. Therefore I decided to remove the docker, so it’s being reset to the latest image which I believe still has version 1.99.3.

BrightSilence · May 8, 2024, 4:58pm

With due respect. You downgraded docker nodes. When the minimum version wasn’t changed for the new rollout, anyone restarting their docker node was rolled back if it wasn’t their turn for the new version yet.

I warned about this in another topic. But in this case it’s not the node operators fault.

Edit: Found the post. See this one and the one 2 below it.

JWvdV · May 8, 2024, 5:39pm

Exactly what has happened here, so the docker images should contain the most recent stable version in my opinion.

Interesting topic BTW, describes exactly the behaviour. Also the fact you’re being downloaded automatically as soon as you remove your docker instance and restart with a clean one: it loses all intercurrent upgrades.

BrightSilence · May 8, 2024, 5:55pm

The images don’t contain any version. Just a script that fetches the latest updater version and then the updater downloads the latest node version. This is necessary for Storj to be able to manage gradual update rollouts. It’s not a problem in itself, they just need to be careful with updating the minimum version before rolling out a new update. Or leave the cursor and seed untouched when switching versions partway through a rollout.

As mentioned in the other topic. I think this was a mistake.

That said, version 26 of the database was added 5 years ago. So something else is going on on your node that makes it want to redo really old database migration steps.

BrightSilence · May 8, 2024, 6:04pm

Could you run?

sqlite3 "piece_expiration.db" 'SELECT * FROM versions;'

To see what version it lists?

JWvdV · May 8, 2024, 7:22pm

Can’t anymore, I removed the databases and restarted the nodes. So it works again.
But those databases remain a pain in the ass now and then.

So, i actually would plea for an option to auto recreate certain databases as soon as a problem is being encountered.

Something like:
storage.recreate-corrupt-db: ['notifications', 'piece_spaced_used', 'heldamount', 'orders', ...]

And implement auto-repair of databases.

BrightSilence · May 8, 2024, 8:00pm

Yeah, no thanks. If the issue is fixable, which it almost certainly would have been in your case, I’d rather fix it than have the stats trashed.

JWvdV · May 8, 2024, 8:48pm

O yeah, I saved the bandwidth, used-space and heldamount databases. The remainder, I trashed. So I still have my stats.

jammerdan · May 9, 2024, 12:59am

They could be renamed or moved instead of trashing them.

BrightSilence · May 9, 2024, 1:26am

Then you have 2 databases with data that needs to be merged again. That leads to all sorts of new problems. There are already fairly simple processes for if someone doesn’t care they lose stats to fix something like this. If the node tries to silently fix problems that shouldn’t be happening in the first place, the underlying issue will never be fixed.

jammerdan · May 9, 2024, 1:48am

But the question is, is it a good tradeoff? A node not starting for keeping statistics?
I am not sure.
Others have also voiced that’s they don’t care about the nodes databases.

I had some node not starting because there was an issue with revocations.db.
Deleted it, node started and recreated it. Is it really required to stop a node for something like that?

Auto fixing would make sure that the node gets back online again and not being halted for database problems which you even not might be able to solve right now anyway. SNO could at least choose if he wants to go that way or not. If he wants auto-fixing there is no technical need to delete the existing databases. If SNO wants to try to merge them he is free to do so but then it is up to him.

BrightSilence · May 9, 2024, 3:18am

It’s about a node not starting because something damaged the databases. That means an underlying issue. It’s never a good idea to hide underlying issues. There may be severe file system issues or fatal errors in the code. Stuff that would never be detected or solved if the databases just get replaced.

So no, it’s not a good tradeoff.

Good for them, that’s not enough reason to damage them irrevocably for those who do care.

This can be really complicated as merging might lead to conflicts. So it may not even be an option anymore. Instead if you leave it alone, you can fix corrupted databases, fix a specific migration that didn’t finish correctly or whatever else is going on, without losing data.

But more importantly, I’d want to know if my node has an underlying issue.

jammerdan · May 9, 2024, 4:19am

But your argument was

So to halt a node for the sake of stats, that’s where I am saying this sounds like a bad idea.

I did not say that. I have said let the SNO choose. If you don’t want to have them replaced, you can keep them and have your node halted, if you don’t want them, you can choose to have them moved out of the way and re-created. I mean the databases are not that essential as node creates them anyway if they are not present.

This sounds more reasonable indeed. But the OP has just done that. Underlying issue remains unknown. I cannot see that the node behavior changed anything or helped here.

From my experience with corrupted node databases I can add that there was no single case for me in which I was able to repair a database despite trying. I always tried because I like keeping my stats. At the end I always had to remove it and have it recreated. And that was always the solution and made the node work again. So it did never help to fix anything or try to find an underlying issue for me. The only thing that this resulted in was downtime. In some cases even undetected for some time as the node tried to restart over and over again. So it even appeared to be running.

If the SNO is fine with potentially losing history data and prefers the node running over stopping it, why not let him have it that way?

Personally I would like very much to have history (old stats) moved away from current month into some kind of archive database that does not get accessed as frequently as the current databases and therefore less chance for corruption. This one should hold all the history data. This would make the decision for deleting databases even easier if you would lose only a couple of days or something.

Alexey · May 9, 2024, 4:33am

It’s better to keep all databases and re-create only failed.
However it’s still requires to move all databases but failed to a backup folder, start the node, wait until all databases are created and migrated, then stop the node and restore databases from the backup folder.

Or at least keep also piece_expiration.db to allow the node to remove expired pieces right away without using of the garbage collector for expired pieces (with TTL) too.

Alexey · May 9, 2024, 4:40am

And there is a way (and I do not want to have it automated too, because following this instruction you explicitly confirm that you are willing to lose the stat and you are aware, that node has an issue and it’s visible this way or another - at least you are aware that node crashes because of that issue):