"database disk image is malformed" - force your luck when official fix is not enough

Hekmon · August 5, 2020, 10:58am

After an unexpected shutdown I had my Storj database “malformed” and orders were not sending anymore because of it. Official fix didn’t work unfortunately (I was “unlucky”) and I had to go deeper. This is how.

First things first, follow the official procedure (you might be lucky):

Step 5.4 will tell you which database(s) is damaged. In my case only orders.db was affected (the biggest one !)
```
-rw-r--r-- 1 root  root  421M août   4 15:08 orders.db
```

The ‘PRAGMA integrity_check;’ returned:

*** in database main ***
On tree page 25419 cell 7: Rowid 58403 out of order
On tree page 35972 cell 7: Rowid 61723 out of order
On tree page 7313 cell 293: 2nd reference to page 35972
On tree page 7240 cell 270: 2nd reference to page 25419
On tree page 7240 cell 100: 2nd reference to page 26242
On tree page 7240 cell 96: 2nd reference to page 2119
On tree page 31252 cell 7: Rowid 1956512 out of order
On tree page 2726 cell 33: 2nd reference to page 29442
On tree page 2726 cell 32: 2nd reference to page 2086
On tree page 2726 cell 31: 2nd reference to page 36689
On tree page 2726 cell 30: 2nd reference to page 31195
On tree page 2726 cell 29: 2nd reference to page 26258
On tree page 2726 cell 28: 2nd reference to page 1000
On tree page 2726 cell 27: 2nd reference to page 31478
On tree page 2726 cell 26: 2nd reference to page 25234
On tree page 2726 cell 24: 2nd reference to page 1019
On tree page 2726 cell 23: 2nd reference to page 38581
On tree page 2726 cell 22: 2nd reference to page 5810
On tree page 2726 cell 21: 2nd reference to page 29484
On tree page 2726 cell 20: 2nd reference to page 31832
On tree page 2726 cell 19: 2nd reference to page 1300
On tree page 2726 cell 18: 2nd reference to page 36434
On tree page 2726 cell 17: 2nd reference to page 32349
On tree page 2726 cell 16: 2nd reference to page 8773
On tree page 2726 cell 15: 2nd reference to page 2435
On tree page 2726 cell 14: 2nd reference to page 24235
On tree page 2726 cell 13: 2nd reference to page 24139
On tree page 2726 cell 12: 2nd reference to page 12768
On tree page 2726 cell 11: 2nd reference to page 30481
On tree page 2726 cell 10: 2nd reference to page 23167
On tree page 2726 cell 9: 2nd reference to page 26278
On tree page 2726 cell 8: 2nd reference to page 26467
On tree page 2726 cell 7: 2nd reference to page 31627
On tree page 2726 cell 6: 2nd reference to page 7861
On tree page 2726 cell 5: 2nd reference to page 2582
On tree page 2726 cell 4: 2nd reference to page 31717
On tree page 2726 cell 3: 2nd reference to page 38257
On tree page 2726 cell 2: 2nd reference to page 22877
On tree page 2726 cell 1: 2nd reference to page 23182
On tree page 449 cell 53: 2nd reference to page 26004
Page 40 is never used
Page 51 is never used
Page 29025 is never used
Page 34673 is never used
NULL value in order_archive_.archived_at
row 33 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 34 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 35 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 36 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 37 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 38 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 39 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 40 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 49 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 50 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 51 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 52 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 53 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 54 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 55 missing from index idx_order_archived_at
NULL value in order_archive_.archived_at
row 56 missing from index idx_order_archived_at
row 1 missing from index idx_orders
row 2 missing from index idx_orders
row 3 missing from index idx_orders
row 4 missing from index idx_orders
row 5 missing from index idx_orders
row 6 missing from index idx_orders
row 7 missing from index idx_orders
row 8 missing from index idx_orders
row 20 missing from index idx_orders
row 21 missing from index idx_orders
row 22 missing from index idx_orders
row 23 missing from index idx_orders
row 24 missing from index idx_orders
row 25 missing from index idx_orders
row 26 missing from index idx_orders
row 27 missing from index idx_orders
row 28 missing from index idx_orders
row 29 missing from index idx_orders
row 30 missing from index idx_orders
row 31 missing from index idx_orders
row 32 missing from index idx_orders
row 33 missing from index idx_orders
row 34 missing from index idx_orders
row 35 missing from index idx_orders

Not looking good !

Following the documentation to dump the database to a SQL file (step 9 and 10) and the remove all transactions (step 11) I got a clean dump to reimport (step 13).
```
-rw-r--r-- 1 root  root  666M août   4 15:19 dump_all_notrans.sql
-rw-r--r-- 1 root  root  666M août   4 15:11 dump_all.sql
```

In my case reimporting the dump took ~15 hours (meanwhile the node was offline). Brace yourself for your node reputation…

Unfortunately reimporting the dump did not go well:

/data # sqlite3 orders.db ".read dump_all_notrans.sql"
Error: near line 762471: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762472: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762473: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762474: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762475: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762476: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762477: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762478: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762487: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762488: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762489: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762490: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762491: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762492: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762493: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762494: NOT NULL constraint failed: order_archive_.archived_at
Error: near line 762716: UNIQUE constraint failed: unsent_order.satellite_id, unsent_order.serial_number

And as step 16 states:

If you are lucky and all corrupted sqlite3 databases are fixed, then you can start the storagenode again.

Warning . If you were not successful with the fix of the database, then your stat is lost.

No more steps. Database is missing an important constraint and storj refused to start because the schema of unsent_order did not match:

août 05 05:40:21 storjserver docker[13463]: 2020-08-05T03:40:21.885Z        INFO        db.migration       Database Version        {"version": 43}
août 05 05:40:22 storjserver docker[13463]: Error: Error during preflight check for storagenode databases: storage node preflight database error: orders: expected schema does not match actual:   &dbschema.Schema{
août 05 05:40:22 storjserver docker[13463]:           Tables: []*dbschema.Table{&{Name: "order_archive_", Columns: []*dbschema.Column{&{Name: "archived_at", Type: "TIMESTAMP"}, &{Name: "order_limit_serialized", Type: "BLOB"}, &{Name: "order_serialized", Type: "BLOB"}, &{Name: "satellite_id", Type: "BLOB"}, &{Name: "serial_number", Type: "BLOB"}, &{Name: "status", Type: "INTEGER"}, &{Name: "uplink_cert_id", Type: "INTEGER", Reference: &dbschema.Reference{Table: "certificate", Column: "cert_id"}}}}, &{Name: "unsent_order", Columns: []*dbschema.Column{&{Name: "order_limit_expiration", Type: "TIMESTAMP"}, &{Name: "order_limit_serialized", Type: "BLOB"}, &{Name: "order_serialized", Type: "BLOB"}, &{Name: "satellite_id", Type: "BLOB"}, &{Name: "serial_number", Type: "BLOB"}, &{Name: "uplink_cert_id", Type: "INTEGER", Reference: &dbschema.Reference{Table: "certificate", Column: "cert_id"}}}}},
août 05 05:40:22 storjserver docker[13463]:           Indexes: []*dbschema.Index{
août 05 05:40:22 storjserver docker[13463]:                   &{Name: "idx_order_archived_at", Table: "order_archive_", Columns: []string{"archived_at"}},
août 05 05:40:22 storjserver docker[13463]: -                 s`Index<Table: unsent_order, Name: idx_orders, Columns: satellite_id serial_number, Unique: true, Partial: "">`,
août 05 05:40:22 storjserver docker[13463]:           },
août 05 05:40:22 storjserver docker[13463]:   }
août 05 05:40:22 storjserver docker[13463]:         storj.io/storj/storagenode/storagenodedb.(*DB).Preflight:346
août 05 05:40:22 storjserver docker[13463]:         main.cmdRun:193
août 05 05:40:22 storjserver docker[13463]:         storj.io/private/process.cleanup.func1.4:353
août 05 05:40:22 storjserver docker[13463]:         storj.io/private/process.cleanup.func1:371
août 05 05:40:22 storjserver docker[13463]:         github.com/spf13/cobra.(*Command).execute:840
août 05 05:40:22 storjserver docker[13463]:         github.com/spf13/cobra.(*Command).ExecuteC:945
août 05 05:40:22 storjserver docker[13463]:         github.com/spf13/cobra.(*Command).Execute:885
août 05 05:40:22 storjserver docker[13463]:         storj.io/private/process.ExecWithCustomConfig:88
août 05 05:40:22 storjserver docker[13463]:         storj.io/private/process.ExecCustomDebug:70
août 05 05:40:22 storjserver docker[13463]:         main.main:323
août 05 05:40:22 storjserver docker[13463]:         runtime.main:203

So time to do some SQL in attempt to save this DB. Let’s check the import errors, there is 2 kind:

NOT NULL constraint failed

This one is “fine”: a row was not inserted because it was malformed and add a column to a NULL value when database schema indicates this is illegal. Actually here the constraint protected the database from a damaged row, skipping it. Not much to do here, everything is already “fixed” (these rows are lost).

UNIQUE constraint failed: unsent_order.satellite_id, unsent_order.serial_number

This one is more problematic. This constraint is actually declaring a composed primary key on 2 columns (satellite_id and serial_number) on the unsent_order table. The error states that the constraint can not be created.

Why ? Because there is some duplicates on this table with rows sharing the same duple of satellite_id and serial_number.
How is this possible ? If I had to guess, taking inspiration from the previous error, I would say that these row have either satellite_id or serial_number at NULL therefor creating PK duplicates (which is not allowed).
Why the schema did not protect ourself like the other error ? Indeed this is interesting. The constraint used as PK is created AFTER all the rows have been inserted. By creating the constraint before, it would have prevented duplicated (but still leaving possibility of NULL values as long as the composed PK was uniq).
Another guess is: as the constraint is actually a consequence here (the exact line has the following SQL: CREATE UNIQUE INDEX idx_orders ON unsent_order(satellite_id, serial_number); ) creating the index first and adding the rows after would have a big performance impact. After each insert the index would have been calculated again. By adding the index after insert all the rows, the index is only computed once. This is not an issue in normal operation (as the dump is not supposed to have rogue/malformed rows).

How to fix this ?

First restart the sqlite container and open the db
Then let’s try to validate the hypothesis of non uniq tupple of satellite_id and serial_number over multiples rows
```
SELECT satellite_id, serial_number FROM unsent_order GROUP BY satellite_id, serial_number HAVING COUNT(*) > 1;
```

I spare you the output as this is string representation of binary blobs: but some results show up !

Let’s transform the request a bit in order to have a more human friendly output and prepare for the next one

SELECT count(*) FROM unsent_order WHERE (satellite_id, serial_number) IN (SELECT satellite_id, serial_number FROM unsent_order GROUP BY satellite_id, serial_number HAVING COUNT(*) > 1);

Output : 248

This means 248 rows are damaged and prevent the creation of the index: let’s remove them.

DELETE FROM unsent_order WHERE (satellite_id, serial_number) IN (SELECT satellite_id, serial_number FROM unsent_order GROUP BY satellite_id, serial_number HAVING COUNT(*) > 1);

Validate they are gone

SELECT count(*) FROM unsent_order WHERE (satellite_id, serial_number) IN (SELECT satellite_id, serial_number FROM unsent_order GROUP BY satellite_id, serial_number HAVING COUNT(*) > 1);

Output : 0

All damaged rows have been deleted, let’s try to recreate the index.

CREATE UNIQUE INDEX idx_orders ON unsent_order(satellite_id, serial_number);

Success !

After that, exiting the sqlite (.exit), the container itself and restarting the storage node did the trick:

août 05 06:27:30 storjserver docker[15114]: 2020-08-05T04:27:30.068Z        INFO        orders.12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs        sending        {"count": 199595}
août 05 06:27:30 storjserver docker[15114]: 2020-08-05T04:27:30.068Z        INFO        orders.1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE        sending        {"count": 37057}
août 05 06:27:30 storjserver docker[15114]: 2020-08-05T04:27:30.068Z        INFO        orders.118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW        sending        {"count": 11616}
août 05 06:27:30 storjserver docker[15114]: 2020-08-05T04:27:30.068Z        INFO        orders.12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB        sending        {"count": 206826}
août 05 06:27:30 storjserver docker[15114]: 2020-08-05T04:27:30.068Z        INFO        orders.12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S        sending        {"count": 155547}
août 05 06:27:30 storjserver docker[15114]: 2020-08-05T04:27:30.068Z        INFO        orders.121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6        sending        {"count": 151530}

I hope this little adventure of mine will help others “unlucky” SNO

BrightSilence · August 5, 2020, 1:37pm

Great explanation!

The downside with your eventual approach is that you removed both occurrences of the duplicate records. Probably not a big deal, but you’re missing some orders as a result.

I would still prefer creating the index first. If you do this import on tmpfs (in ram) it’s still pretty fast, even if it has to check the unique index. That way the first insert of a duplicate row will work and the second will fail.

Hekmon · August 5, 2020, 1:49pm

The downside with your eventual approach is that you removed both occurrences of the duplicate records. Probably not a big deal, but you’re missing some orders as a result.

Indeed. I was on the mood “I might loose all my orders so I don’t care if I have to remove a bit more than necessary (I only had 248 suspects) as long as I can fix my db”. Handling binary blobs is not something really nice but you are right: some of these orders might have been legit and could have been saved by going even deeper.

I would still prefer creating the index first. If you do this import on tmpfs (in ram) it’s still pretty fast, even if it has to check the unique index. That way the first insert of a duplicate row will work and the second will fail.

tmpfs… obviously… downside of fixing something at 5am is not having all your brain powered on

But even with this nice speedup there is still a risk: index will skip duplicates but like the previous error it is still possible to have one the two binary blobs at NULL (or both, not something you want). The index will not protect you from the first tupple (something, NULL) or (NULL, something) or (NULL, NULL). If you go that way, like the first point I would validate no NULL is present after the insertion to be 100% safe, the index taking care of the all the duplicates.

TL;DR - I went the quick & safe way by accepting a small risk of deleting some valid rows but you can definitely go even deeper and be sure to not remove any valid rows by being extra carefull

Thanks for your feedback !

zyvier · August 5, 2020, 2:24pm

hmmm can’t we do a looping stored procedure to search for duplicates and delete the one with the highest rowid here, so that eventually it becomes 1 unique row without any duplicate entry?

something like

retrieve all duplicates + count, group by satellite_id, serial number having count > 1
for each one, for as many count - 1, delete the record that matches the unique identifier and max row_id… or you could just delete all records that matches the unique identifer and is not the min/max (row_id) too… no need for loops there

by the end of the above run it should fix all duplicate issue as part of database repair?

Hekmon · August 5, 2020, 4:49pm

I guess this could work, but I am not skilled enough in sqlite to answer about stored procedures

As of rows with possible NULL satellite_id/serial_number, I have to correct myself on this hypothesis, the schema already has a constraint on it so insert would fail. This means all the duplicates was “real” duplicates:

sqlite> .schema unsent_order
CREATE TABLE unsent_order (
  satellite_id  BLOB NOT NULL,
  serial_number BLOB NOT NULL,

  order_limit_serialized BLOB      NOT NULL, -- serialized pb.OrderLimit
  order_serialized       BLOB      NOT NULL, -- serialized pb.Order
  order_limit_expiration TIMESTAMP NOT NULL, -- when is the deadline for sending it

  uplink_cert_id INTEGER NOT NULL,

  FOREIGN KEY(uplink_cert_id) REFERENCES certificate(cert_id)
);
CREATE UNIQUE INDEX idx_orders ON unsent_order(satellite_id, serial_number);
sqlite>

(this means I did deleted some valid rows that could have been kept when all their related duplicates would have been removed as @BrightSilence pointed out)