Error migrating tables for database on storagenode: migrate: v57

Hi, i use storagenode v3 in docker. Since yesterday my node does not start. This is what i see in the logs:

text  error  warn  system  array  login  
2024-05-14 11:47:08,898 INFO success: processes-exit-eventlistener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-05-14 11:47:08,902 INFO success: storagenode entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-05-14 11:47:08,902 INFO exited: storagenode (exit status 1; not expected)
2024-05-14 11:47:09,905 INFO spawned: 'storagenode' with pid 43
2024-05-14 11:47:09,906 INFO success: storagenode-updater entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-05-14 11:47:09,906 WARN received SIGQUIT indicating exit request
2024-05-14 11:47:09,906 INFO waiting for storagenode, processes-exit-eventlistener, storagenode-updater to die
2024-05-14T11:47:09+02:00       INFO    Got a signal from the OS: "terminated"  {"Process": "storagenode-updater"}
2024-05-14 11:47:09,910 INFO stopped: storagenode-updater (exit status 0)
2024-05-14T11:47:09+02:00       INFO    Configuration loaded    {"Process": "storagenode", "Location": "/app/config/config.yaml"}
2024-05-14T11:47:09+02:00       INFO    Anonymized tracing enabled      {"Process": "storagenode"}
2024-05-14T11:47:09+02:00       INFO    Operator email  {"Process": "storagenode", "Address": "*"}
2024-05-14T11:47:09+02:00       INFO    Operator wallet {"Process": "storagenode", "Address": "*"}
2024-05-14T11:47:09+02:00       INFO    server  kernel support for server-side tcp fast open remains disabled.  {"Process": "storagenode"}
2024-05-14T11:47:09+02:00       INFO    server  enable with: sysctl -w net.ipv4.tcp_fastopen=3  {"Process": "storagenode"}
2024-05-14T11:47:10+02:00       INFO    Telemetry enabled       {"Process": "storagenode", "instance ID": "*"}
2024-05-14T11:47:10+02:00       INFO    Event collection enabled        {"Process": "storagenode", "instance ID": "*"}
2024-05-14T11:47:10+02:00       INFO    db.migration.57 Create new bandwidth_usage table, backfilling data from bandwidth_usage_rollups and bandwidth_usage tables, and dropping the old tables.  {"Process": "storagenode"}
2024-05-14T11:47:10+02:00       ERROR   failure during run      {"Process": "storagenode", "error": "Error migrating tables for database on storagenode: migrate: v57: NOT NULL constraint failed: bandwidth_usage_new.interval_start\n\tstorj.io/storj/private/migrate.(*Migration).Run:212\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:387\n\tmain.cmdRun:98\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267", "errorVerbose": "Error migrating tables for database on storagenode: migrate: v57: NOT NULL constraint failed: bandwidth_usage_new.interval_start\n\tstorj.io/storj/private/migrate.(*Migration).Run:212\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:387\n\tmain.cmdRun:98\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267\n\tmain.cmdRun:100\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267"}
Error: Error migrating tables for database on storagenode: migrate: v57: NOT NULL constraint failed: bandwidth_usage_new.interval_start
        storj.io/storj/private/migrate.(*Migration).Run:212
        storj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:387
        main.cmdRun:98
        main.newRunCmd.func1:33
        storj.io/common/process.cleanup.func1.4:393
        storj.io/common/process.cleanup.func1:411
        github.com/spf13/cobra.(*Command).execute:983
        github.com/spf13/cobra.(*Command).ExecuteC:1115
        github.com/spf13/cobra.(*Command).Execute:1039
        storj.io/common/process.ExecWithCustomOptions:112
        main.main:34
        runtime.main:267
2024-05-14 11:47:18,397 INFO exited: storagenode (exit status 1; not expected)

hey @Ironcurtain we are looking into this; thank you for reporting this!
We will post an update asap when the team figures out what the issue might be.

2 Likes

Hey @Ironcurtain, could you send a copy of your bandwidth.db file to help figure out what might be the cause of the failing migration?

Also you could make a copy of the bandwidthdb:

cp /mnt/node/storage/bandwidth.db /mnt/node/storage/bandwidth_test.db 

and then run the query below:

sqlite3 /mnt/node/storage/bandwidth_test.db 'CREATE TABLE bandwidth_usage_new (
    						interval_start TIMESTAMP NOT NULL,
    						satellite_id BLOB NOT NULL,
    						put_total BIGINT DEFAULT 0,
    						get_total BIGINT DEFAULT 0,
    						get_audit_total BIGINT DEFAULT 0,
    						get_repair_total BIGINT DEFAULT 0,
    						put_repair_total BIGINT DEFAULT 0,
    						delete_total BIGINT DEFAULT 0,
    						PRIMARY KEY (interval_start, satellite_id)
                        );
						INSERT INTO bandwidth_usage_new (
							interval_start,
							satellite_id,
							put_total,
							get_total,
							get_audit_total,
							get_repair_total,
							put_repair_total,
							delete_total
						)
						SELECT
						    datetime(date(interval_start)) as interval_start,
						    satellite_id,
							SUM(CASE WHEN action = 1 THEN amount ELSE 0 END) AS put_total,
							SUM(CASE WHEN action = 2 THEN amount ELSE 0 END) AS get_total,
							SUM(CASE WHEN action = 3 THEN amount ELSE 0 END) AS get_audit_total,
							SUM(CASE WHEN action = 4 THEN amount ELSE 0 END) AS get_repair_total,
							SUM(CASE WHEN action = 5 THEN amount ELSE 0 END) AS put_repair_total,
							SUM(CASE WHEN action = 6 THEN amount ELSE 0 END) AS delete_total
						FROM
						    bandwidth_usage_rollups
						GROUP BY
							datetime(date(interval_start)), satellite_id, action
						ON CONFLICT(interval_start, satellite_id) DO UPDATE SET
						    put_total = put_total + excluded.put_total,
							get_total = get_total + excluded.get_total,
							get_audit_total = get_audit_total + excluded.get_audit_total,
							get_repair_total = get_repair_total + excluded.get_repair_total,
							put_repair_total = put_repair_total + excluded.put_repair_total,
							delete_total = delete_total + excluded.delete_total;'

If it completes successfully, run this second query:

sqlite3 /mnt/node/storage/bandwidth_test.db 'INSERT INTO bandwidth_usage_new (
						    interval_start,
							satellite_id,
							put_total,
							get_total,
							get_audit_total,
							get_repair_total,
							put_repair_total,
							delete_total
						)
						SELECT
						    datetime(date(created_at)) as interval_start,
						    satellite_id,
							SUM(CASE WHEN action = 1 THEN amount ELSE 0 END) AS put_total,
							SUM(CASE WHEN action = 2 THEN amount ELSE 0 END) AS get_total,
							SUM(CASE WHEN action = 3 THEN amount ELSE 0 END) AS get_audit_total,
							SUM(CASE WHEN action = 4 THEN amount ELSE 0 END) AS get_repair_total,
							SUM(CASE WHEN action = 5 THEN amount ELSE 0 END) AS put_repair_total,
							SUM(CASE WHEN action = 6 THEN amount ELSE 0 END) AS delete_total
						FROM
						    bandwidth_usage
						GROUP BY
							datetime(date(created_at)), satellite_id, action
						ON CONFLICT(interval_start, satellite_id) DO UPDATE SET
						    put_total = put_total + excluded.put_total,
							get_total = get_total + excluded.get_total,
							get_audit_total = get_audit_total + excluded.get_audit_total,
							get_repair_total = get_repair_total + excluded.get_repair_total,
							put_repair_total = put_repair_total + excluded.put_repair_total,
							delete_total = delete_total + excluded.delete_total;'

You can then remove this test database:

rm -rf /mnt/node/storage/bandwidth_test.db

Kindly post any issue that comes up

2 Likes

Thank you @Ironcurtain.

We found some data in the your bandwidth.db that suggest there might be data corruption in the database:

While we’re fixing the migration to handle such cases, you can

  1. run this query on the actual bandwidth.db database to remove invalid rows from the database and restart your node:

    DELETE FROM bandwidth_usage_rollups WHERE date(interval_start) IS NULL OR satellite_id IS NULL;
    DELETE FROM bandwidth_usage WHERE date(created_at) IS NULL OR satellite_id IS NULL;
    
  2. if you don’t care much about losing data on the graph, you can delete the db and restart the node

The database is really large – you can run VACUUM; on the database to make it smaller

3 Likes

This wouldn’t be enough, storagenode is unable to re-create a single missing database correctly.

2 Likes

Hello @Ironcurtain,
Welcome to the forum!

If you would decide to re-create this database, you need to follow this guide:

2 Likes