Returned to find my node offline and restarted to no avail

Rapturoso · August 24, 2020, 9:29pm

I tried to reinstall but the MSI installer just asked if I wanted to update Storj. Even after updating and rebooting and trying several times to start the Storj Windows Service it would still not run.

What’s happening here and what can I do to fix this ? I do not want to be disqualified!

Here’s the log file.

I have no clue what it’s trying to tell me.

I can’t even post a text format log file here as it keeps telling me new users can only post two links? Really? No links are being posted and this forum seems to think the log file is full of them.

Please help!

mastrip2 · August 24, 2020, 9:46pm

Looks like its probably a corrupted database. Try going thru the steps to confirm that and fix it.

Rapturoso · August 24, 2020, 11:03pm

It now tells me that there is nothing wrong or at least this is how it seems telling me the db files are all OK apart from the orders.db, which I have been told here before isn’t a database and can be safely ignored when trying to see if there are any errors with the database files.

Any ideas ? This is becoming a pain.

Alexey · August 25, 2020, 4:58am

@mastrip2 the “file is not a database” cannot be fixed with article “How to fix a malformed database”, but it can help to figure out, which one is unrecoverable corrupted.
@Rapturoso you found the broken database orders.db
Please,

Stop the storagenode
Remove the corrupted orders.db database
Execute (replace /path/to/orders.db to the actual path):

sqlite3 /path/to/orders.db

When you see a sqlite> prompt, execute the script:

CREATE TABLE unsent_order (
                                                satellite_id  BLOB NOT NULL,
                                                serial_number BLOB NOT NULL,
                                                order_limit_serialized BLOB      NOT NULL, -- serialized pb.OrderLimit
                                                order_serialized       BLOB      NOT NULL, -- serialized pb.Order
                                                order_limit_expiration TIMESTAMP NOT NULL, -- when is the deadline for sending it
                                                uplink_cert_id INTEGER NOT NULL,
                                                FOREIGN KEY(uplink_cert_id) REFERENCES certificate(cert_id)
                                        );
CREATE TABLE order_archive_ (
                                                satellite_id  BLOB NOT NULL,
                                                serial_number BLOB NOT NULL,
                                                order_limit_serialized BLOB NOT NULL,
                                                order_serialized       BLOB NOT NULL,
                                                uplink_cert_id INTEGER NOT NULL,
                                                status      INTEGER   NOT NULL,
                                                archived_at TIMESTAMP NOT NULL,
                                                FOREIGN KEY(uplink_cert_id) REFERENCES certificate(cert_id)
                                        );
CREATE UNIQUE INDEX idx_orders ON unsent_order(satellite_id, serial_number);
CREATE TABLE versions (version int, commited_at text);
.exit

Start the storagenode
Check your logs

Rapturoso · August 25, 2020, 4:02pm

Thanks for that. It seems to have brought the node back online.

This node is running on server grade hardware with Intel dual processor main board, ECC Registered DRAM and is running 24/7. At no time did the server power down and at no time did I ask the node to stop running, it just went offline with no warning and I have no idea how long it was offline (maybe even a day before I noticed in the web UI?).

How can I avoid a corrupted database in the future?

Also, why is the node’s Suspension and Audit Score still at 100% when it’s been offline for more than a day? Is this accurate?

Alexey · August 25, 2020, 7:41pm

If you use Windows - please, disable the write cache for the disk with storagenode’s data unless you have a manageable UPS in place too (with ability to properly shutdown the server if there is a power loss).

Suspension score will be affected only in case of “unknown” errors during audit.
The audit score will be affected in case of known audit errors: piece not found, corrupted or 4x timeout for the audit of the same piece in row.
If your server were offline - there is no way to perform an audit, so there is no reason to affect any of both audit scores.

The disqualification for downtime is currently disabled. It’s in the development:

Rapturoso · August 25, 2020, 10:23pm

Thanks for all the info, however, I did mention in my last post that power never failed so I don’t see how disabling write caching would prevent corruption whilst power is stable.

Alexey · August 26, 2020, 6:07am

I do not believe in magic. To be corrupted, the database file should be damaged somehow. The known ways to corrupt the SQLite: https://www.sqlite.org/howtocorrupt.html

For Windows it’s known issue, if it is rebooted unexpectedly when the file is open and write cache is enabled, with a non-zero probability the file will be corrupted.
I know, Windows 10 has bugs in every build. It could fire BSOD and reboot in any time, so, I can’t trust it any serious load, except Windows Server from stable conservative channel.

The other way to broke the SQLite is to use a network attached drives any kind. The NFS and SMB are not compatible with SQLite, however, the SMB could work in some circumstances (Windows Server - Windows Client locally), but even then if any real networking is involved between client and server - this have several problems: https://forum.storj.io/tag/smb
The only working network protocol is iSCSI, however, the network is involved and this leads to problems too: https://forum.storj.io/tag/iscsi

If you use the network attached drive for the storage, please consider to change your setup to use only local-attached drives.

If you use a docker desktop to run a storagenode, please, consider to migrate to the Windows GUI: https://documentation.storj.io/resources/faq/migrate-my-node/migrating-from-docker-cli-to-a-gui-install-on-windows
If it’s not an option - then at least rollback the Docker desktop to 2.1.0.5 due to problems with all latest release of this product.

If you use an alternative antivirus - add the storage location to the exceptions list of the antivirus. The data is encrypted and cannot be checked anyway, for the SNO and antiviurs it’s just a digital noise without knowing of encryption keys and where is all other pieces are located (i.e. you need to have an API key too).

Rapturoso · August 26, 2020, 11:38am

I thought about all this too.

At no point did the system reboot at the time the node went down.

No BSOD happened. The Windows logs show this and as another factor for confirmation, all other programmes that I had started manually and that are not set to automatically start at boot were still running when I returned to find the node offline.

There are no network attached storage devices. All drives are local and connected to the server’s SAS SATA connectors.

I have never used Docker and I have always used the Windows GUI install.

This problem isn’t due to hardware or OS failure, power interruption, using external/network attached storage or write caching. So what else could be the problem with Storj here?

Alexey · August 26, 2020, 6:22pm

I have no idea. This problem doesn’t detected yet. You are the first one.

Usually database corruption related either to disk subsystem or to abruptly termination of the storagenode service during the work.

Did you have any OOM kills for the storagenode service?

thepaul · August 26, 2020, 6:27pm

This might be a new case! Do you have logs for the storagenode back from before the error started occurring?

Also, are there any indications of disk write errors? One possibility is that the db ended up on a bad sector or the like.

thepaul · August 26, 2020, 6:30pm

Just to clarify, orders.db is a database. It’s possible someone got confused by the error “file is not a database”, which sqlite outputs when it thinks that a file is not a database, because of some sort of corruption.

Alexey · August 26, 2020, 6:59pm

Unfortunately the fix for this corruption only one - delete the file and create a new one with half of schema (the migration on storagenode will finish the process).

BrightSilence · August 27, 2020, 3:49pm

Ironically after the next update it literally won’t be a database anymore.

thepaul · August 27, 2020, 5:39pm

Heh, that’s true, although it won’t be called “orders.db” anymore.

Rapturoso · September 2, 2020, 7:06pm

Alexey:

CREATE TABLE unsent_order (
                                                satellite_id  BLOB NOT NULL,
                                                serial_number BLOB NOT NULL,
                                                order_limit_serialized BLOB      NOT NULL, -- serialized pb.OrderLimit
                                                order_serialized       BLOB      NOT NULL, -- serialized pb.Order
                                                order_limit_expiration TIMESTAMP NOT NULL, -- when is the deadline for sending it
                                                uplink_cert_id INTEGER NOT NULL,
                                                FOREIGN KEY(uplink_cert_id) REFERENCES certificate(cert_id)
                                        );
CREATE TABLE order_archive_ (
                                                satellite_id  BLOB NOT NULL,
                                                serial_number BLOB NOT NULL,
                                                order_limit_serialized BLOB NOT NULL,
                                                order_serialized       BLOB NOT NULL,
                                                uplink_cert_id INTEGER NOT NULL,
                                                status      INTEGER   NOT NULL,
                                                archived_at TIMESTAMP NOT NULL,
                                                FOREIGN KEY(uplink_cert_id) REFERENCES certificate(cert_id)
                                        );
CREATE UNIQUE INDEX idx_orders ON unsent_order(satellite_id, serial_number);
CREATE TABLE versions (version int, commited_at text);
.exit

I have received an email in the last hour saying that my node is disqualified on the us-central-1 satellite, even though it has been fixed a few times for orders.db corruption and was quite happily online and working fine when the email was sent.

At no time has the node not had access to the storj location on the hard disc drive whilst the node has been online, yet it’s been disqualified after 7 months of pretty consistent uptime, barring the 3 times that the orders.db corruption problem took the node offline straight away pending recreation of the orders.db file which took just a few hours each time whilst the node was completely offline. Those three times the node went offline consistently, not remaining online to be audited and checked for corruption. Why oh why is this reputable node disqualified ? FFS. WFT is happening ?

Alexey · September 2, 2020, 7:11pm

The disqualification has nothing to do with any database.
The disqualification is happened when the audit score dropped below 0.6 (60% on dashboard).
This could happen only when your node is unable provide the piece for audit - either lost/corrupted or lost an access to it.

So, please, search for GET_AUDIT and failed in the same time in your logs to figure out what happened.

Rapturoso · September 2, 2020, 7:16pm

FFS. My hard disc drive lost physical connection whilst I was rearranging the server area. Data is still here, all complete. Found the problem immediately after checking SATA SAS connectors. I need to undisqualify this node. How do I go about doing this? Please help.

Edit: Actually no that’s not what happened at all. The disc just lots connection for some reason without being physically disrupted. Goodness knows whats happening here.

Alexey · September 2, 2020, 7:22pm

Unfortunately the disqualification is permanent and not reversible.
Neither satellite nor the storagenode can be sure that data is intact.
For the network it’s much cheaper just recover lost files sometime in the future (when the number of healthy pieces drop below the threshold) than to proceed a full audit of your node (to process an audit the satellite should download every single piece from your node for the same cost as recover).

Rapturoso · September 2, 2020, 7:24pm

That’s absolutely frigging ridiculous. So if someones storage is temporarily disrupted whilst no data corruption has happened because there are measures in place to safely close files properly, you still disqualify people’s nodes immediately within less than an hour and not bother to give them reasonable opportunity to bring the node back online? That’s downright nasty.

So basically Storj have literally and unashamedly just stolen the equivalent of what’s in my wallet, tantamount to a mugging. Disgusting.

I’ve worked hard to keep this node online for 7 months and when there have been problems I have jumped on them straight away to resolve them and this is how the network and Storj repay me, with a literal F U ?