Node having issues after operating for long time without problems

Alexey · November 16, 2022, 9:03am

so you actually lost data, or used identity not from this data and it’s missing its data.
It could be possible, that when you run with SETUP=true and deleted config.yaml file you provided either a wrong path to data (it must not end with storage subfolder) or a wrong path to the identity. Or you really lost data.
There is no other way to be disqualified:

Your node were online
the node answers on audit requests
Provided a wrong data (corrupted piece) or returned an error “file not found” or did not return a piece after 5 minutes timeout after 3 attempts.

So please check what was the reason in your case:

It’s also possible, that you run your node with already disqualified identity from the mentioned thread, so you likely have another identity somewhere (we suggests to move the identity to the disk with data to avoid such confusion), and this another identity could be not disqualified yet.
In the last case you need to use the procedure with SETUP=true and renamed config.yaml to configure your node to use the correct identity.
It would be required to copy pieces from the trash to their original locations (the wrong identity could move some unrecognized data to there)

gkn · November 16, 2022, 10:02am

Hello @Alexey,
thanks for pointing out possible reasons that caused that. Well I check if there is some mismatch with config.yaml but the identity definitely was the correct one as I only have that node on this machine and the storage-volume were still intact after reconstructing the container that was missing after volume-resize for docker.

May be it was stupid to run with SETUP=true, it was kind of trying whats possible and somewhere in storj forums this was pointed out - good to know it’s stupid at all.

I’ll think about next steps if the node-identity is really burned either completly quit or may be generate new identity. Don’t know yet. I still think the STORJ node software needs some better code-hardening in a way to assist an operator in such a lot problems that need manual research and tricking around such problems to resolve them, which costs lot of time.

When I would have known before my node will be disqualified I wouldn’t have spent so much time in that and either would have quit or generated new identity… but then at least I wouldn’t have wasted hours of trying to do my best as operator to keep the current one up.

In such things I keep in mind that the node is part of distributed storage and simply flushing the storage loosing all data feels wrong to me! But with the result now I definetly will not spent so much time in fixing and do what’s most simple to do without wasting time.

I think STORJ developers should think if that’s what they want as result of such issues. For me a distributed storage should be stable and not force operators to flush data. If too much operators do so, the network/distributed storage may become faulty causing unrecoverable data-losses in worst case.

gkn · November 16, 2022, 10:04am

Just as small result in research after this happened. The config.yaml is fine. I’ve just increased this attribute to 560 GB (after resizing storj-storage volume as well) in hope to “trick out” the piececake monitor which reported wrong size for some reason (the storj-storage filesystem and storage was untouched all the time until I tried that).

# was 550 GB before
storage.allocated-disk-space: 560 GB
``

Alexey · November 16, 2022, 10:18am

In that case please search for errors related to GET_AUDIT or GET_REPAIR, I think you will find “file not found” errors, so your node lost data.
Maybe it’s related to how you bind volumes in combination with SETUP=true: if the disk is missing, the bind with path/to/data:/app/config will create an empty volume instead of failing with error. The following SETUP=true just confirmed for storagenode to use this empty volume instead of failing with error.
For the path to mount point it’s also true, if mount was not happen, it will use a provided empty mount point to store data (because SETUP=true were invoked). Without SETUP=true storagenode will fail, if it could have missing data.

gkn · November 16, 2022, 10:35am

Well thank you anyway… even if you are shure the audit failure was caused by file-loss.

You said SETUP=true may have caused it, so my question back is then: why doesn’t the STORJ-Node refuse to do that if it obvisously causes problems? Why even is there such a flag if STORJ-Node could figure out it is initialized yet or not?

I’m also a programmer and software should even be “idiot save”. One should not give users (or operators) flags or arguments that only can lead up in stupid things, if a software can solve it without that.

So my conclusion:
Just dropped of lvm-volume for storj-node storage and quit. As said… I’ve already wasted hours of trying to fixup and that I’m now punished with unrecoverable node leads me to the only right thing… not wasting more time now.

Alexey · November 16, 2022, 10:59am

We specifically separated the setup procedure to avoid such issues. There’s a precaution - it will fall, if there is a config.yaml. But you removed it too. So there’s no way remained to check is it a valid invoke or user just made a mistake.
In case if you didn’t run the setup step with wrong paths, the node will fail, either because storage-dir-verification is missing or contains a wrong NodeID, or will continue to work, if this file is exist and contains the correct NodeID.
It also fail, if the storage dir either not writeable or not readable.
This file is created only during setup step. This is why we asks never run it for worked node.

gkn · November 16, 2022, 11:29am

@Alexey if STORJ can identity/punish possible nodes with misconfiguration and wrong node-id it can also prevent such nodes to go in online-state enforcing such a punish.

Nice to know there are still some steps trying to avoid such hazzle. But to be honest - if STORJ can punish it can also prevent operators from such issues!

Well… doesn’t matter. I’ve gave STORJ one last chance with newly created identity but obvisously not only my identity, but my IP is burned too?! Showing now misconfigured and wrong node ID. So the quit seems to be final.

Hope you and your team are happy on this result causing operators to quit frustrated. Thats not how community works.

Alexey · November 16, 2022, 12:38pm

There is no way to detect a wrong identity or data location, if you overwrite it with setup command.

IP is not tied with the identity, so seems you did not generate a new identity and tried to run the node with an old one.
If you want to start from scratch, you need to generate a new identity, sign it with a new authorization token and run your node with clean storage and this new identity.
Please do not use the binding in the form you have used before, use this one:

Alexey:

services:
  storagenode5:
...
    volumes:
      - type: bind
        source: /mnt/w/storagenode5/identity/
        target: /app/identity
      - type: bind
        source: /mnt/w/storagenode5/
        target: /app/config

gkn · November 16, 2022, 1:05pm

Exactly what I did! Even more… I’ve deleted lvm-volume, created new one for storage, created new identity & authorized it, deleted storj-container and re-created everything from scratch.

Alexey · November 16, 2022, 1:19pm

Could you please elaborate, what is an issue?

gkn · November 16, 2022, 1:26pm

WARN contact:service failed PingMe request to satellite {“Process”: “storagenode”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “error”: “ping satellite: check-in network: failed to ping node (ID: 1ZbQyH865u9q4U5Ey5zEcwYTA2dKHkoozHzM39BLKtjW3gQtv6) at address: my-node-hostname:28967, err: contact: failed to ping storage node using QUIC, your node indicated error code: 0, rpc: quic: timeout: no recent network activity”, “errorVerbose”: “ping satellite: check-in network: failed to ping node (ID: 1ZbQyH865u9q4U5Ey5zEcwYTA2dKHkoozHzM39BLKtjW3gQtv6) at address: my-node-hostname:28967, err: contact: failed to ping storage node using QUIC, your node indicated error code: 0, rpc: quic: timeout: no recent network activity\n\tstorj.io/storj/storagenode/contact.(*Service).requestPingMeOnce:194\n\tstorj.io/storj/storagenode/contact.(*Service).RequestPingMeQUIC:167\n\tstorj.io/storj/storagenode.(*Peer).addConsoleService:868\n\tstorj.io/storj/storagenode.(*Peer).Run:907\n\tmain.cmdRun:251\n\tstorj.io/private/process.cleanup.func1.4:377\n\tstorj.io/private/process.cleanup.func1:395\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfigAndLogger:92\n\tmain.main:478\n\truntime.main:255”}

Nothing changed at firewall and trying to connect using “nc -vz my-node-hostname:28967” works from my local machine!

Stob · November 16, 2022, 1:47pm

If you ping my-node-hostname from the local pc does it show a NAT IP (10.x.x.x or 192.168.x.x.x) or an external public IP? The my-node-hostname should resolve anywhere on the internet to a public IP.

gkn · November 16, 2022, 1:48pm

I’ve replaced the original hostname with this stub-name ‘my-node-hostname’

Stob · November 16, 2022, 1:53pm

I understand. You also wouldn’t be the first person to use an internally accessible hostname/IP for what should be a publicly accessible node.

Alexey · November 16, 2022, 2:14pm

Please also make sure that you have TCP and UDP port mappings in your docker-compose.yaml file, i.e.:

services:
  storagenode2:
    ports:
      - 28967:28967/tcp
      - 28967:28967/udp
      - 14002:14002

gkn · November 16, 2022, 3:24pm

I’ve completly removed docker, reinstalled docker and everything to make shure there’s nothing skrewed.

Reason:
nc -vz my-public-ip:28967
nc: connect to 91.210.xxx.xxx port 28967 (tcp) failed: Connection refused

No matter if on docker-host or my workstation!

But …
nc -vz 127.0.0.1 28967
Connection to 127.0.0.1 28967 port [tcp/*] succeeded!

ufw/iptables is configured to forward port 28967 udp/tcp

# iptables -L | grep 28967           
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:28967
ACCEPT     udp  --  anywhere             anywhere             udp dpt:28967

I’m totally confused … what the hell is going on here?

Other services running on docker work fine for example portainer, nginx-ingress, …

In fact… something is skrewed for some reason. By the way this time I also cleaned /var/lib/docker and started totally from scratch

gkn · November 16, 2022, 4:06pm

hmm… just installed netstat inside storj-container and checked. Theres no port 28967 anymore inside storj-node container!!!

What the hell… is the port for external mapping to 28967 now 7778??!

# netstat -pant
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.11:39891        0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:37149         0.0.0.0:*               LISTEN      7/storagenode       
tcp        0      0 127.0.0.1:7778          0.0.0.0:*               LISTEN      7/storagenode       
...
tcp6       0      0 :::7777                 :::*                    LISTEN      7/storagenode       
tcp6       0      0 :::14002                :::*                    LISTEN      7/storagenode

gkn · November 16, 2022, 4:18pm

Okay… don’t know why but in my docker-compose.yaml services.storj.environment the SETUP=“true” env did not trigger setup to create new config.yaml for my new identity …

I’ve been running ./storagenode setup --config-dir ./config/ --identity-dir ./identity/
and then it started working with new identity …

really crappy

Alexey · November 17, 2022, 3:29am

Is your node online?

By the way, to make this command work from the same network your router should support a hairpin NAT (allows connection to its own external interface from the local network).
But in this case, your storagenode was most likely not running and the port was closed.

gkn · November 17, 2022, 10:40am

Hi @Alexey yes the fresh node is now online (and still is). For some reason the initial setup (using SETUP=“true” ENV) didn’t fire up, but executing it manually helped.

By the way just noticed I’ve published the IP. I’m not a friend of duing such things, could you do me a favor and mask the IP like this " 91.210.xxx.xxx"? I already did that in my original post you quoted