1.16.1 Dashboard Problem

Hello,
I guess I made a big mistake!
Yesterday I asked when to update. I just updated one of my nodes to 1.16.1 and it locked good so far.

This morning I updated all my other nodes.
Now I check my Dashboards and the Node I updated first is not responding! The Dashboard shows offline and no other info. The CLI Script is not returning anything at all!

Stopping and removing also does not work!

Great Update!

So I forced a reboot and now the node shows ā€œonlineā€. I think a question of time until the other nodes run into problems.

Is there any tool to check if the nodes are ā€œonlineā€ despite having to check the dashboard from time to time? Uptimerrobot reports online either way as the port is reachableā€¦

And one more thing:
After the update the audit script reports this:
./successrate.sh storagenode
========== AUDIT ==============
Critically failed: 0
Critical Fail Rate: 0.000%
Recoverable failed: 0
Recoverable Fail Rate: 0.000%
Successful: 0
Success Rate: 0.000%
========== DOWNLOAD ===========
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 1
Success Rate: 100.000%
========== UPLOAD =============
Rejected: 0
Acceptance Rate: 100.000%
---------- accepted -----------
Failed: 0
Fail Rate: 0.000%
Canceled: 16
Cancel Rate: 8.649%
Successful: 169
Success Rate: 91.351%
========== REPAIR DOWNLOAD ====
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 0
Success Rate: 0.000%
========== REPAIR UPLOAD ======
Failed: 0
Fail Rate: 0.000%
Canceled: 29
Cancel Rate: 43.284%
Successful: 38
Success Rate: 56.716%
========== DELETE =============
Failed: 0
Fail Rate: 0.000%
Successful: 1
Success Rate: 100.000%

While it had audits before the updateā€¦

What was a mem usage when your node is stall?

Well as I restarted already I cannot check now, can I?
Will check next time it happens (guess it will). The node that ā€œcrashedā€ is running on 2 GB RAM.

Another node crashed / unresponsive!
Mem:
total used free shared buff/cache available
Mem: 977 617 93 2 267 381
Swap: 0 0 0

This is another node. Updated this morning to 1.16.1
All command related to the node do not return anything!

This time it ā€œcrashedā€ when running: for sat in docker exec -i storagenode wget -qO - localhost:14002/api/sno | jq .satellites[].id -r; do docker exec -i storagenode wget -qO - localhost:14002/api/sno/satellite/$sat | jq .id,.audit; done

@Alexey Why does this: docker stop -t 300 storagenode
docker rm storagenode return no audits after a restart?

Also @Alexey another node just restarted itself for no apparent reason.

5 min later here we go again. Best update ever!

@Alexey
Now I am getting this and the node will not start at all!

2020-11-12T10:29:42.331Z ERROR piecestore failed to add bandwidth usage {ā€œerrorā€: ā€œbandwidthdb error: database disk image is malformedā€, ā€œerrorVerboseā€: ā€œbandwidthdb error: database disk image is malformed\n\tstorj.io/storj/storagenode/storagenodedb.(*bandwidthDB).getSummary:171\n\tstorj.io/storj/storagenode/storagenodedb.(*bandwidthDB).Summary:113\n\tstorj.io/storj/storagenode/storagenodedb.(*bandwidthDB).Add:52\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).beginSaveOrder.func1:683\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:413\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:996\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51ā€}

Heeeeeeellllllppppp!!!

This did the trick:

sqlite3 bandwidth.db
CREATE TABLE bandwidth_usage (
satellite_id BLOB NOT NULL,
action INTEGER NOT NULL,
amount BIGINT NOT NULL,
created_at TIMESTAMP NOT NULL
);
CREATE TABLE bandwidth_usage_rollups (
interval_start TIMESTAMP NOT NULL,
satellite_id BLOB NOT NULL,
action INTEGER NOT NULL,
amount BIGINT NOT NULL,
PRIMARY KEY ( interval_start, satellite_id, action )
);
CREATE INDEX idx_bandwidth_usage_satellite ON bandwidth_usage(satellite_id);
CREATE INDEX idx_bandwidth_usage_created ON bandwidth_usage(created_at);
.exit

Question: What about the vetting? Why is it set to 0 after my update? Using: ./successrate.sh storagenode to check on itā€¦

Docker rm also removes the docker logs for the container, so your local log files get ā€˜resetā€™. Your Storj data is fine because it is stored in a separate volume outside of the docker container.

The crash could be cause my an out of memory error from the file walker at startup. You can run on 2 GB, but it can get tight, especially if your hardware is slower. Can your provide more details on your setup?

  1. What model of drives are you using (are they SMR)
  2. How many nodes are you running on the device
  3. What type of system is hosting this (Pi, netbook, etc.)
  4. Is this device being used for anything else

Thanks for responding!

This node runs on 1 GB ram and is a little vm in my basement. One Core 1 GB. My other Nodes do have 2 GB. Drives for this one are im my NAS, so network drives.

I guess the problem were, that I sudo rebooted, without unmounting the drives first.

i would be careful about using the sudo reboot command, it tends to just ā€¦ well reboot
without shutting anything down at allā€¦ iā€™m kinda new to linux and tho my system havenā€™t been hurt by this command, thatā€™s more likely due to my setup having plp storage and all written data is forced to be sync, basically making the system mostly unaffected by cuts in power or similarā€¦

ofc it losses powerā€¦ so not totally unaffected, but i donā€™t really have a need for a UPS, so i did a more affordable version that would have basically the same end result.

but yeah donā€™t use sudo reboot, use sudo shutdown so that everything gets shutdown correctlyā€¦

then everything will boot up on itā€™s own on start up, which also means in case of a power outageā€¦ if you have to tinker with the device to make it boot or reboot, then itā€™s just less likely it will recover from whatever problems it may encounter

using sudo shutdown you can even see in the storagenode logs and the storagenode gets the termination signal from the OS and shutsdown and the OS will even wait for the storagenode / docker to shutdown if need beā€¦

so really you shouldnā€™t have to use anything but the run command when you start the node and then sudo shutdown to stop the systemā€¦ the node shuts down and when you boot the node simply starts backup because the run command contains the start unless stopped parameterā€¦ i forget how itā€™s defines in the run command thoā€¦ something like thatā€¦ you get the idea.

network drives are not recommended, it might workā€¦ but usually it wonā€™t go well long term.

and yeah REALLY REALLY donā€™t use sudo reboot if you system isnā€™t protected against random power loss, and even then you shouldnā€™t use it for for convenience, even tho you in theory couldā€¦ itā€™s a bad habit, and it has a very good chance of causing damage or loss of data in the systemā€¦ sure not muchā€¦ but enough to make it act all weird.

1 Like

What a nice surprise when you just updated and now the saltlike satellite is reporting:
saltlake.tardigrade.io:7777

Online

75 %

Niiiiceeee!

Hello? @Alexey
Moooore charactersā€¦

Can you check your server load looks okay?
And ensure your disk isnā€™t about to fail?

If your setup cannot keep up (I had this issue with an SMR drive), the node software keeps stacking requests in RAM until the oom killer shuts down the nodeā€¦ Dunno if it could be a similar issue in your case, I havenā€™t experience with network drives.

1 Like

and

are incompatible.

The network attached drives are not supported, even if they could work.
Any network attached drives will always have a higher latency compared to local connected.
If this setup is working for you, you should increase the available RAM in 2-4 times against local connected drives (because storagenode will cache more data in RAM because of slow storage).
However, you will have other problems as well, and the corrupted database is a smallest of them:
https://forum.storj.io/tag/nfs
https://forum.storj.io/tag/smb
https://forum.storj.io/tag/iscsi

Please, try to avoid any network attached drives, even external USB is not reliable enough: Search results for 'external USB error' - Storj Community Forum (official)

2 Likes

Thanks! Will look into it. Maybe I can connect the nas directly to the node for starters.

Where are the 75% online comming from and how can I better that? Just wait?

You are welcome!
If your NAS support docker, you can run the storagenode directly on your NAS.

The online score measures how much your node was online. 60% will suspend your node, until you resolve the issue (for that you will have 7 days), then your node would be under review for a month. If it managed to be suspended again - it will be disqualified.

At the moment this should not be enabled, but will be soon. However, the measurement is already in place and as you can see it is working. The score should recover during next 30 days online.

1 Like

Iā€™d like to point out with the binary you can run a storagenode on pretty much any NAS with terminal.
Fairly easy to setup as well. I tested on a cheap nas I got for some backups and it worked and it had 512mb of ram. So you could probably do it as well.

2 Likes

I would add we even have at least one storagenode working on a router:

2 Likes

lol thats epicā€¦

next up smart watches, toasters, fridgesā€¦ actually that might make some pretty good promotional stuffā€¦ a competition for who can run a storagenode on the most unexpected device. or something lol

2 Likes