Error migrating tables for database on storagenode: migrate: v57

Ironcurtain · May 14, 2024, 9:54am

Hi, i use storagenode v3 in docker. Since yesterday my node does not start. This is what i see in the logs:

text  error  warn  system  array  login  
2024-05-14 11:47:08,898 INFO success: processes-exit-eventlistener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-05-14 11:47:08,902 INFO success: storagenode entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-05-14 11:47:08,902 INFO exited: storagenode (exit status 1; not expected)
2024-05-14 11:47:09,905 INFO spawned: 'storagenode' with pid 43
2024-05-14 11:47:09,906 INFO success: storagenode-updater entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-05-14 11:47:09,906 WARN received SIGQUIT indicating exit request
2024-05-14 11:47:09,906 INFO waiting for storagenode, processes-exit-eventlistener, storagenode-updater to die
2024-05-14T11:47:09+02:00       INFO    Got a signal from the OS: "terminated"  {"Process": "storagenode-updater"}
2024-05-14 11:47:09,910 INFO stopped: storagenode-updater (exit status 0)
2024-05-14T11:47:09+02:00       INFO    Configuration loaded    {"Process": "storagenode", "Location": "/app/config/config.yaml"}
2024-05-14T11:47:09+02:00       INFO    Anonymized tracing enabled      {"Process": "storagenode"}
2024-05-14T11:47:09+02:00       INFO    Operator email  {"Process": "storagenode", "Address": "*"}
2024-05-14T11:47:09+02:00       INFO    Operator wallet {"Process": "storagenode", "Address": "*"}
2024-05-14T11:47:09+02:00       INFO    server  kernel support for server-side tcp fast open remains disabled.  {"Process": "storagenode"}
2024-05-14T11:47:09+02:00       INFO    server  enable with: sysctl -w net.ipv4.tcp_fastopen=3  {"Process": "storagenode"}
2024-05-14T11:47:10+02:00       INFO    Telemetry enabled       {"Process": "storagenode", "instance ID": "*"}
2024-05-14T11:47:10+02:00       INFO    Event collection enabled        {"Process": "storagenode", "instance ID": "*"}
2024-05-14T11:47:10+02:00       INFO    db.migration.57 Create new bandwidth_usage table, backfilling data from bandwidth_usage_rollups and bandwidth_usage tables, and dropping the old tables.  {"Process": "storagenode"}
2024-05-14T11:47:10+02:00       ERROR   failure during run      {"Process": "storagenode", "error": "Error migrating tables for database on storagenode: migrate: v57: NOT NULL constraint failed: bandwidth_usage_new.interval_start\n\tstorj.io/storj/private/migrate.(*Migration).Run:212\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:387\n\tmain.cmdRun:98\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267", "errorVerbose": "Error migrating tables for database on storagenode: migrate: v57: NOT NULL constraint failed: bandwidth_usage_new.interval_start\n\tstorj.io/storj/private/migrate.(*Migration).Run:212\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:387\n\tmain.cmdRun:98\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267\n\tmain.cmdRun:100\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267"}
Error: Error migrating tables for database on storagenode: migrate: v57: NOT NULL constraint failed: bandwidth_usage_new.interval_start
        storj.io/storj/private/migrate.(*Migration).Run:212
        storj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:387
        main.cmdRun:98
        main.newRunCmd.func1:33
        storj.io/common/process.cleanup.func1.4:393
        storj.io/common/process.cleanup.func1:411
        github.com/spf13/cobra.(*Command).execute:983
        github.com/spf13/cobra.(*Command).ExecuteC:1115
        github.com/spf13/cobra.(*Command).Execute:1039
        storj.io/common/process.ExecWithCustomOptions:112
        main.main:34
        runtime.main:267
2024-05-14 11:47:18,397 INFO exited: storagenode (exit status 1; not expected)

brandon · May 14, 2024, 2:26pm

hey @Ironcurtain we are looking into this; thank you for reporting this!
We will post an update asap when the team figures out what the issue might be.

clement · May 14, 2024, 3:01pm

Hey @Ironcurtain, could you send a copy of your bandwidth.db file to help figure out what might be the cause of the failing migration?

Also you could make a copy of the bandwidthdb:

cp /mnt/node/storage/bandwidth.db /mnt/node/storage/bandwidth_test.db

and then run the query below:

sqlite3 /mnt/node/storage/bandwidth_test.db 'CREATE TABLE bandwidth_usage_new (
    						interval_start TIMESTAMP NOT NULL,
    						satellite_id BLOB NOT NULL,
    						put_total BIGINT DEFAULT 0,
    						get_total BIGINT DEFAULT 0,
    						get_audit_total BIGINT DEFAULT 0,
    						get_repair_total BIGINT DEFAULT 0,
    						put_repair_total BIGINT DEFAULT 0,
    						delete_total BIGINT DEFAULT 0,
    						PRIMARY KEY (interval_start, satellite_id)
                        );
						INSERT INTO bandwidth_usage_new (
							interval_start,
							satellite_id,
							put_total,
							get_total,
							get_audit_total,
							get_repair_total,
							put_repair_total,
							delete_total
						)
						SELECT
						    datetime(date(interval_start)) as interval_start,
						    satellite_id,
							SUM(CASE WHEN action = 1 THEN amount ELSE 0 END) AS put_total,
							SUM(CASE WHEN action = 2 THEN amount ELSE 0 END) AS get_total,
							SUM(CASE WHEN action = 3 THEN amount ELSE 0 END) AS get_audit_total,
							SUM(CASE WHEN action = 4 THEN amount ELSE 0 END) AS get_repair_total,
							SUM(CASE WHEN action = 5 THEN amount ELSE 0 END) AS put_repair_total,
							SUM(CASE WHEN action = 6 THEN amount ELSE 0 END) AS delete_total
						FROM
						    bandwidth_usage_rollups
						GROUP BY
							datetime(date(interval_start)), satellite_id, action
						ON CONFLICT(interval_start, satellite_id) DO UPDATE SET
						    put_total = put_total + excluded.put_total,
							get_total = get_total + excluded.get_total,
							get_audit_total = get_audit_total + excluded.get_audit_total,
							get_repair_total = get_repair_total + excluded.get_repair_total,
							put_repair_total = put_repair_total + excluded.put_repair_total,
							delete_total = delete_total + excluded.delete_total;'

If it completes successfully, run this second query:

sqlite3 /mnt/node/storage/bandwidth_test.db 'INSERT INTO bandwidth_usage_new (
						    interval_start,
							satellite_id,
							put_total,
							get_total,
							get_audit_total,
							get_repair_total,
							put_repair_total,
							delete_total
						)
						SELECT
						    datetime(date(created_at)) as interval_start,
						    satellite_id,
							SUM(CASE WHEN action = 1 THEN amount ELSE 0 END) AS put_total,
							SUM(CASE WHEN action = 2 THEN amount ELSE 0 END) AS get_total,
							SUM(CASE WHEN action = 3 THEN amount ELSE 0 END) AS get_audit_total,
							SUM(CASE WHEN action = 4 THEN amount ELSE 0 END) AS get_repair_total,
							SUM(CASE WHEN action = 5 THEN amount ELSE 0 END) AS put_repair_total,
							SUM(CASE WHEN action = 6 THEN amount ELSE 0 END) AS delete_total
						FROM
						    bandwidth_usage
						GROUP BY
							datetime(date(created_at)), satellite_id, action
						ON CONFLICT(interval_start, satellite_id) DO UPDATE SET
						    put_total = put_total + excluded.put_total,
							get_total = get_total + excluded.get_total,
							get_audit_total = get_audit_total + excluded.get_audit_total,
							get_repair_total = get_repair_total + excluded.get_repair_total,
							put_repair_total = put_repair_total + excluded.put_repair_total,
							delete_total = delete_total + excluded.delete_total;'

You can then remove this test database:

rm -rf /mnt/node/storage/bandwidth_test.db

Kindly post any issue that comes up

clement · May 14, 2024, 4:11pm

Thank you @Ironcurtain.

We found some data in the your bandwidth.db that suggest there might be data corruption in the database:

While we’re fixing the migration to handle such cases, you can

run this query on the actual bandwidth.db database to remove invalid rows from the database and restart your node:

DELETE FROM bandwidth_usage_rollups WHERE date(interval_start) IS NULL OR satellite_id IS NULL;
DELETE FROM bandwidth_usage WHERE date(created_at) IS NULL OR satellite_id IS NULL;

if you don’t care much about losing data on the graph, you can delete the db and restart the node

The database is really large – you can run VACUUM; on the database to make it smaller

Alexey · May 15, 2024, 4:24am

This wouldn’t be enough, storagenode is unable to re-create a single missing database correctly.

Alexey · May 15, 2024, 4:24am

Hello @Ironcurtain,
Welcome to the forum!

If you would decide to re-create this database, you need to follow this guide: