After an unexpected shutdown I had my Storj database “malformed” and orders were not sending anymore because of it. Official fix didn’t work unfortunately (I was “unlucky”) and I had to go deeper. This is how.
- First things first, follow the official procedure (you might be lucky):
-
Step 5.4 will tell you which database(s) is damaged. In my case only orders.db was affected (the biggest one !)
-rw-r--r-- 1 root root 421M août 4 15:08 orders.db
The ‘PRAGMA integrity_check;’ returned:
*** in database main ***
On tree page 25419 cell 7: Rowid 58403 out of order
On tree page 35972 cell 7: Rowid 61723 out of order
On tree page 7313 cell 293: 2nd reference to page 35972
On tree page 7240 cell 270: 2nd reference to page 25419
On tree page 7240 cell 100: 2nd reference to page 26242
On tree page 7240 cell 96: 2nd reference to page 2119
On tree page 31252 cell 7: Rowid 1956512 out of order
On tree page 2726 cell 33: 2nd reference to page 29442
On tree page 2726 cell 32: 2nd reference to page 2086
On tree page 2726 cell 31: 2nd reference to page 36689
On tree page 2726 cell 30: 2nd reference to page 31195
On tree page 2726 cell 29: 2nd reference to page 26258
On tree page 2726 cell 28: 2nd reference to page 1000
On tree page 2726 cell 27: 2nd reference to page 31478
On tree page 2726 cell 26: 2nd reference to page 25234
On tree page 2726 cell 24: 2nd reference to page 1019
On tree page 2726 cell 23: 2nd reference to page 38581
On tree page 2726 cell 22: 2nd reference to page 5810
On tree page 2726 cell 21: 2nd reference to page 29484
On tree page 2726 cell 20: 2nd reference to page 31832
On tree page 2726 cell 19: 2nd reference to page 1300
On tree page 2726 cell 18: 2nd reference to page 36434
On tree page 2726 cell 17: 2nd reference to page 32349
On tree page 2726 cell 16: 2nd reference to page 8773
On tree page 2726 cell 15: 2nd reference to page 2435
On tree page 2726 cell 14: 2nd reference to page 24235
On tree page 2726 cell 13: 2nd reference to page 24139
On tree page 2726 cell 12: 2nd reference to page 12768
On tree page 2726 cell 11: 2nd reference to page 30481
On tree page 2726 cell 10: 2nd reference to page 23167
On tree page 2726 cell 9: 2nd reference to page 26278
On tree page 2726 cell 8: 2nd reference to page 26467
On tree page 2726 cell 7: 2nd reference to page 31627
On tree page 2726 cell 6: 2nd reference to page 7861
On tree page 2726 cell 5: 2nd reference to page 2582
On tree page 2726 cell 4: 2nd reference to page 31717
On tree page 2726 cell 3: 2nd reference to page 38257
On tree page 2726 cell 2: 2nd reference to page 22877
On tree page 2726 cell 1: 2nd reference to page 23182
On tree page 449 cell 53: 2nd reference to page 26004
Page 40 is never used
Page 51 is never used
Page 29025 is never used
Page 34673 is never used
NULL value in order_archive_.archived_at
row 33 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 34 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 35 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 36 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 37 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 38 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 39 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 40 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 49 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 50 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 51 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 52 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 53 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 54 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 55 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 56 missing from index idx_order_archived_at
row 1 missing from index idx_orders
row 2 missing from index idx_orders
row 3 missing from index idx_orders
row 4 missing from index idx_orders
row 5 missing from index idx_orders
row 6 missing from index idx_orders
row 7 missing from index idx_orders
row 8 missing from index idx_orders
row 20 missing from index idx_orders
row 21 missing from index idx_orders
row 22 missing from index idx_orders
row 23 missing from index idx_orders
row 24 missing from index idx_orders
row 25 missing from index idx_orders
row 26 missing from index idx_orders
row 27 missing from index idx_orders
row 28 missing from index idx_orders
row 29 missing from index idx_orders
row 30 missing from index idx_orders
row 31 missing from index idx_orders
row 32 missing from index idx_orders
row 33 missing from index idx_orders
row 34 missing from index idx_orders
row 35 missing from index idx_orders
Not looking good !
-
Following the documentation to dump the database to a SQL file (step 9 and 10) and the remove all transactions (step 11) I got a clean dump to reimport (step 13).
-rw-r--r-- 1 root root 666M août 4 15:19 dump_all_notrans.sql -rw-r--r-- 1 root root 666M août 4 15:11 dump_all.sql
In my case reimporting the dump took ~15 hours (meanwhile the node was offline). Brace yourself for your node reputation…
Unfortunately reimporting the dump did not go well:
/data # sqlite3 orders.db ".read dump_all_notrans.sql"
Error: near line 762471: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762472: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762473: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762474: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762475: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762476: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762477: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762478: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762487: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762488: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762489: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762490: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762491: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762492: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762493: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762494: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762716: UNIQUE constraint failed: unsent_order.satellite_id, unsent_order.serial_number
And as step 16 states:
If you are lucky and all corrupted
sqlite3
databases are fixed, then you can start the storagenode again.Warning . If you were not successful with the fix of the database, then your stat is lost.
No more steps. Database is missing an important constraint and storj refused to start because the schema of unsent_order did not match:
août 05 05:40:21 storjserver docker[13463]: 2020-08-05T03:40:21.885Z INFO db.migration Database Version {"version": 43}
août 05 05:40:22 storjserver docker[13463]: Error: Error during preflight check for storagenode databases: storage node preflight database error: orders: expected schema does not match actual: &dbschema.Schema{
août 05 05:40:22 storjserver docker[13463]: Tables: []*dbschema.Table{&{Name: "order_archive_", Columns: []*dbschema.Column{&{Name: "archived_at", Type: "TIMESTAMP"}, &{Name: "order_limit_serialized", Type: "BLOB"}, &{Name: "order_serialized", Type: "BLOB"}, &{Name: "satellite_id", Type: "BLOB"}, &{Name: "serial_number", Type: "BLOB"}, &{Name: "status", Type: "INTEGER"}, &{Name: "uplink_cert_id", Type: "INTEGER", Reference: &dbschema.Reference{Table: "certificate", Column: "cert_id"}}}}, &{Name: "unsent_order", Columns: []*dbschema.Column{&{Name: "order_limit_expiration", Type: "TIMESTAMP"}, &{Name: "order_limit_serialized", Type: "BLOB"}, &{Name: "order_serialized", Type: "BLOB"}, &{Name: "satellite_id", Type: "BLOB"}, &{Name: "serial_number", Type: "BLOB"}, &{Name: "uplink_cert_id", Type: "INTEGER", Reference: &dbschema.Reference{Table: "certificate", Column: "cert_id"}}}}},
août 05 05:40:22 storjserver docker[13463]: Indexes: []*dbschema.Index{
août 05 05:40:22 storjserver docker[13463]: &{Name: "idx_order_archived_at", Table: "order_archive_", Columns: []string{"archived_at"}},
août 05 05:40:22 storjserver docker[13463]: - s`Index<Table: unsent_order, Name: idx_orders, Columns: satellite_id serial_number, Unique: true, Partial: "">`,
août 05 05:40:22 storjserver docker[13463]: },
août 05 05:40:22 storjserver docker[13463]: }
août 05 05:40:22 storjserver docker[13463]: storj.io/storj/storagenode/storagenodedb.(*DB).Preflight:346
août 05 05:40:22 storjserver docker[13463]: main.cmdRun:193
août 05 05:40:22 storjserver docker[13463]: storj.io/private/process.cleanup.func1.4:353
août 05 05:40:22 storjserver docker[13463]: storj.io/private/process.cleanup.func1:371
août 05 05:40:22 storjserver docker[13463]: github.com/spf13/cobra.(*Command).execute:840
août 05 05:40:22 storjserver docker[13463]: github.com/spf13/cobra.(*Command).ExecuteC:945
août 05 05:40:22 storjserver docker[13463]: github.com/spf13/cobra.(*Command).Execute:885
août 05 05:40:22 storjserver docker[13463]: storj.io/private/process.ExecWithCustomConfig:88
août 05 05:40:22 storjserver docker[13463]: storj.io/private/process.ExecCustomDebug:70
août 05 05:40:22 storjserver docker[13463]: main.main:323
août 05 05:40:22 storjserver docker[13463]: runtime.main:203
- So time to do some SQL in attempt to save this DB. Let’s check the import errors, there is 2 kind:
NOT NULL constraint failed
This one is “fine”: a row was not inserted because it was malformed and add a column to a NULL value when database schema indicates this is illegal. Actually here the constraint protected the database from a damaged row, skipping it. Not much to do here, everything is already “fixed” (these rows are lost).
UNIQUE constraint failed: unsent_order.satellite_id, unsent_order.serial_number
This one is more problematic. This constraint is actually declaring a composed primary key on 2 columns (satellite_id and serial_number) on the unsent_order table. The error states that the constraint can not be created.
- Why ? Because there is some duplicates on this table with rows sharing the same duple of satellite_id and serial_number.
- How is this possible ? If I had to guess, taking inspiration from the previous error, I would say that these row have either satellite_id or serial_number at NULL therefor creating PK duplicates (which is not allowed).
- Why the schema did not protect ourself like the other error ? Indeed this is interesting. The constraint used as PK is created AFTER all the rows have been inserted. By creating the constraint before, it would have prevented duplicated (but still leaving possibility of NULL values as long as the composed PK was uniq).
- Another guess is: as the constraint is actually a consequence here (the exact line has the following SQL: CREATE UNIQUE INDEX idx_orders ON unsent_order(satellite_id, serial_number); ) creating the index first and adding the rows after would have a big performance impact. After each insert the index would have been calculated again. By adding the index after insert all the rows, the index is only computed once. This is not an issue in normal operation (as the dump is not supposed to have rogue/malformed rows).
How to fix this ?
-
First restart the sqlite container and open the db
-
Then let’s try to validate the hypothesis of non uniq tupple of satellite_id and serial_number over multiples rows
SELECT satellite_id, serial_number FROM unsent_order GROUP BY satellite_id, serial_number HAVING COUNT(*) > 1;
I spare you the output as this is string representation of binary blobs: but some results show up !
-
Let’s transform the request a bit in order to have a more human friendly output and prepare for the next one
SELECT count(*) FROM unsent_order WHERE (satellite_id, serial_number) IN (SELECT satellite_id, serial_number FROM unsent_order GROUP BY satellite_id, serial_number HAVING COUNT(*) > 1);
Output : 248
-
This means 248 rows are damaged and prevent the creation of the index: let’s remove them.
DELETE FROM unsent_order WHERE (satellite_id, serial_number) IN (SELECT satellite_id, serial_number FROM unsent_order GROUP BY satellite_id, serial_number HAVING COUNT(*) > 1);
-
Validate they are gone
SELECT count(*) FROM unsent_order WHERE (satellite_id, serial_number) IN (SELECT satellite_id, serial_number FROM unsent_order GROUP BY satellite_id, serial_number HAVING COUNT(*) > 1);
Output : 0
-
All damaged rows have been deleted, let’s try to recreate the index.
CREATE UNIQUE INDEX idx_orders ON unsent_order(satellite_id, serial_number);
Success !
After that, exiting the sqlite (.exit), the container itself and restarting the storage node did the trick:
août 05 06:27:30 storjserver docker[15114]: 2020-08-05T04:27:30.068Z INFO orders.12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs sending {"count": 199595}
août 05 06:27:30 storjserver docker[15114]: 2020-08-05T04:27:30.068Z INFO orders.1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE sending {"count": 37057}
août 05 06:27:30 storjserver docker[15114]: 2020-08-05T04:27:30.068Z INFO orders.118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW sending {"count": 11616}
août 05 06:27:30 storjserver docker[15114]: 2020-08-05T04:27:30.068Z INFO orders.12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB sending {"count": 206826}
août 05 06:27:30 storjserver docker[15114]: 2020-08-05T04:27:30.068Z INFO orders.12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S sending {"count": 155547}
août 05 06:27:30 storjserver docker[15114]: 2020-08-05T04:27:30.068Z INFO orders.121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 sending {"count": 151530}
I hope this little adventure of mine will help others “unlucky” SNO