Nodes offline for 3-4 days - is it possible to recover?

I had no choice and had to take my nodes offline, I had some hardware issues I had to iron out.

I’ve tried to bring them all back online and i’ve been having a lot of issues with things hanging up, the dashboard websites not loading, docker becoming unresponsive, etc.

Before I get too much more in the weeds trying to get these things working, should I just give it up? Is being offline for that long unrecoverable?

Thanks folks.

4 days should be OK. It is below all the DQ thresholds.

ok then, time to continue my troubleshooting.

So what appears to be my main issue is that, via docker logs, the nodes appear to be functioning.

However, the dashboards don’t load, and docker commands against the nodes don’t ever complete, things like that. And eventually I need to restart the containers.

Where should I start with troubleshooting what seems to be more a Docker issue, than a node issue…?

I’m running on Mac, btw.

Can you give the results of docker ps -a via command line. What version of docker for Mac are you using? I remember a while back it was recommended to use version, although this may not be related. The node was working fine before the downtime?

docker ps -a results
(I know, I’m not running Watchtower right now. I’ll put it back soon once everything’s working):

CONTAINER ID        IMAGE                          COMMAND             CREATED             STATUS              PORTS                                                NAMES
9cb1f00a9f62        storjlabs/storagenode:latest   "/entrypoint"       4 hours ago         Up 3 hours>14002/tcp,>28967/tcp   storagenode2
326d3d2e3ad5        storjlabs/storagenode:latest   "/entrypoint"       4 hours ago         Up 3 hours>14002/tcp,>28967/tcp   storagenode1
b5cb4fdd1898        storjlabs/storagenode:latest   "/entrypoint"       5 hours ago         Up 3 hours>28967/tcp,>14002/tcp   storagenode3

Docker version:

Client: Docker Engine - Community
 Cloud integration: 1.0.2
 Version:           19.03.13
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        4484c46d9d
 Built:             Wed Sep 16 16:58:31 2020
 OS/Arch:           darwin/amd64
 Experimental:      false

Server: Docker Engine - Community
  Version:          19.03.13
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46d9d
  Built:            Wed Sep 16 17:07:04 2020
  OS/Arch:          linux/amd64
  Experimental:     false
  Version:          v1.3.7
  GitCommit:        8fba4e9a7d01810a393d5d25a3621dc101981175
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
  Version:          0.18.0
  GitCommit:        fec3683

You could try using one terminal window to watch the logs in real time and another to try loading the web and cli dashboards. For example:

Terminal 1:
docker logs --tail 20 --follow storagenode1

Terminal 2:
docker exec -it storagenode1 /app/

And try navigating to the web dashboard page for that node.

I can view logs, via the first command, but neither the web dashboard or the dashboard via CLI ever loads.

Here are some of the recent log lines which I believe indicate that it’s working…

2021-02-10T21:02:43.022Z	INFO	piecestore	download started	{"Piece ID": "WZFIEKJ4YGN6MMXPDSZQS7RNTMAIGKRMOTJ45KKWDUM3H3SFXY2A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET"}
2021-02-10T21:02:59.318Z	INFO	piecestore	download started	{"Piece ID": "Q6S3ERXFDJSDFPIWEZI4JH56WQTGODRPJCSB6W2YH4AAMN6NFXGQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET"}
2021-02-10T21:03:33.189Z	INFO	piecestore	download started	{"Piece ID": "O2LBDXQSKLGO7YONW5M5GT4P4VCWCPQSIAXP32R5BC3GZZG3JNTA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET"}
2021-02-10T21:04:37.010Z	INFO	piecestore	download started	{"Piece ID": "IT46WQEP5HWM5IRFVWYKQBTW5K6BDZGNGA2JPLT7R6VWMK2F46NQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET"}
2021-02-10T21:05:17.795Z	INFO	piecestore	download started	{"Piece ID": "IT46WQEP5HWM5IRFVWYKQBTW5K6BDZGNGA2JPLT7R6VWMK2F46NQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET"}
2021-02-10T21:06:40.277Z	INFO	piecestore	download started	{"Piece ID": "W3MX7OCKL5QPVX5M4ISJ3E6746P7UQZXVQPL5QF2TFCN4W4JBY4A", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET"}
2021-02-10T21:06:46.135Z	INFO	piecestore	download started	{"Piece ID": "65W3TUDAC36SVCSGXS2YAN6HJPBWE5WISVXKCZ4ZKUMQGZHLXN5Q", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET"}
2021-02-10T21:06:51.158Z	INFO	piecestore	download started	{"Piece ID": "IB4TL5APGXQQITU5CS772KS7JLPQ3XNWL6VNFNC45BZ47RRBMLHA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET"}
2021-02-10T21:07:39.236Z	INFO	piecestore	download started	{"Piece ID": "E3A3AREBBRMXW3ZYPNDFTOUR2YM7GYB2MPJ2Z7PPWC3PU6UJDWCQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET"}

Interesting. When I launch the CLI dashboard I get this in the logs:

2021-02-10T21:10:39.123Z        INFO    Configuration loaded    {"Location": "/app/config/config.yaml"}
2021-02-10T21:10:39.168Z        INFO    Identity loaded.        {"Node ID": "removed"}

Have you made sure you are running the latest storagenode version? You can also modify the config.yaml file and change the log level to debug. There might be more info available. You will need to restart the node for this to take effect.

# the minimum log level to log
log.level: debug

debug logs, good idea. doing that now will report back

ok turned on debug logs, restarted. Definitely seeing some errors, but again, it does still seem to be working somewhat. Except that the dashboards don’t load.

A lot of I/O errors. My next step would be to stop the node and check this disk/filesystem for errors.

OK, that’s what I was suspecting/worried about.

I ran fsck_hfs already on the 3 drives I’ve got running nodes, and didn’t uncover any issues.

Anything else I can try?

Also, I’m running the drives from a USB3 enclosure, in case that matters.

Here’s the result of a scan of the disk that that storage node goes with:

% sudo fsck_hfs -fl /dev/disk4s2
** /dev/rdisk4s2 (NO WRITE)
Executing fsck_hfs (version hfs-522.100.5).
** Performing live verification.
** Checking Journaled HFS Plus volume.
The volume name is WD_1TB
** Checking extents overflow file.
** Checking catalog file.
** Checking multi-linked files.
** Checking catalog hierarchy.
** Checking extended attributes file.
** Checking volume bitmap.
** Checking volume information.
** The volume WD_1TB appears to be OK.

Unfortunately I am running low on ideas. You could try checking your databases for errors (if you haven’t done so already). Although I would be surprised if all of your nodes have that same issue. Seems more likely a docker or configuration issue. You could try rolling back your docker desktop version to an earlier one.

you are running docker On windows there were problems with version over


Well there was nothing mentioned in the official install docs about which version of Docker to use, so at first, yes I just installed the latest. Which was 3.x something… Rolled it back to the last 2.x version, and then after searching the forum some more, I found that thread. I’ve since installed and everything has been smooth sailing for ~24 hours. :slight_smile:

I was migrating my nodes from a linux box to a Mac, and on the linux box everything “just worked” (haha) so I assumed on the Mac I could just install Docker and run with it. Bad assumption! :slight_smile:

Might be a good idea to mention Docker version numbers in the setup documentation, @Alexey is that something you could look at?

It is mentioned for Windows

Added almost the same for Mac:

1 Like