Node having issues after operating for long time without problems

so you actually lost data, or used identity not from this data and it’s missing its data.
It could be possible, that when you run with SETUP=true and deleted config.yaml file you provided either a wrong path to data (it must not end with storage subfolder) or a wrong path to the identity. Or you really lost data.
There is no other way to be disqualified:

  1. Your node were online
  2. the node answers on audit requests
  3. Provided a wrong data (corrupted piece) or returned an error “file not found” or did not return a piece after 5 minutes timeout after 3 attempts.

So please check what was the reason in your case:

It’s also possible, that you run your node with already disqualified identity from the mentioned thread, so you likely have another identity somewhere (we suggests to move the identity to the disk with data to avoid such confusion), and this another identity could be not disqualified yet.
In the last case you need to use the procedure with SETUP=true and renamed config.yaml to configure your node to use the correct identity.
It would be required to copy pieces from the trash to their original locations (the wrong identity could move some unrecognized data to there)

Hello @Alexey,
thanks for pointing out possible reasons that caused that. Well I check if there is some mismatch with config.yaml but the identity definitely was the correct one as I only have that node on this machine and the storage-volume were still intact after reconstructing the container that was missing after volume-resize for docker.

May be it was stupid to run with SETUP=true, it was kind of trying whats possible and somewhere in storj forums this was pointed out - good to know it’s stupid at all.

I’ll think about next steps if the node-identity is really burned either completly quit or may be generate new identity. Don’t know yet. I still think the STORJ node software needs some better code-hardening in a way to assist an operator in such a lot problems that need manual research and tricking around such problems to resolve them, which costs lot of time.

When I would have known before my node will be disqualified I wouldn’t have spent so much time in that and either would have quit or generated new identity… but then at least I wouldn’t have wasted hours of trying to do my best as operator to keep the current one up.

In such things I keep in mind that the node is part of distributed storage and simply flushing the storage loosing all data feels wrong to me! But with the result now I definetly will not spent so much time in fixing and do what’s most simple to do without wasting time.

I think STORJ developers should think if that’s what they want as result of such issues. For me a distributed storage should be stable and not force operators to flush data. If too much operators do so, the network/distributed storage may become faulty causing unrecoverable data-losses in worst case.

Just as small result in research after this happened. The config.yaml is fine. I’ve just increased this attribute to 560 GB (after resizing storj-storage volume as well) in hope to “trick out” the piececake monitor which reported wrong size for some reason (the storj-storage filesystem and storage was untouched all the time until I tried that).

# was 550 GB before
storage.allocated-disk-space: 560 GB
``
1 Like

In that case please search for errors related to GET_AUDIT or GET_REPAIR, I think you will find “file not found” errors, so your node lost data.
Maybe it’s related to how you bind volumes in combination with SETUP=true: if the disk is missing, the bind with path/to/data:/app/config will create an empty volume instead of failing with error. The following SETUP=true just confirmed for storagenode to use this empty volume instead of failing with error.
For the path to mount point it’s also true, if mount was not happen, it will use a provided empty mount point to store data (because SETUP=true were invoked). Without SETUP=true storagenode will fail, if it could have missing data.

Well thank you anyway… even if you are shure the audit failure was caused by file-loss.

You said SETUP=true may have caused it, so my question back is then: why doesn’t the STORJ-Node refuse to do that if it obvisously causes problems? Why even is there such a flag if STORJ-Node could figure out it is initialized yet or not?

I’m also a programmer and software should even be “idiot save”. One should not give users (or operators) flags or arguments that only can lead up in stupid things, if a software can solve it without that.

So my conclusion:
Just dropped of lvm-volume for storj-node storage and quit. As said… I’ve already wasted hours of trying to fixup and that I’m now punished with unrecoverable node leads me to the only right thing… not wasting more time now.

We specifically separated the setup procedure to avoid such issues. There’s a precaution - it will fall, if there is a config.yaml. But you removed it too. So there’s no way remained to check is it a valid invoke or user just made a mistake.
In case if you didn’t run the setup step with wrong paths, the node will fail, either because storage-dir-verification is missing or contains a wrong NodeID, or will continue to work, if this file is exist and contains the correct NodeID.
It also fail, if the storage dir either not writeable or not readable.
This file is created only during setup step. This is why we asks never run it for worked node.

@Alexey if STORJ can identity/punish possible nodes with misconfiguration and wrong node-id it can also prevent such nodes to go in online-state enforcing such a punish.

Nice to know there are still some steps trying to avoid such hazzle. But to be honest - if STORJ can punish it can also prevent operators from such issues!

Well… doesn’t matter. I’ve gave STORJ one last chance with newly created identity but obvisously not only my identity, but my IP is burned too?! Showing now misconfigured and wrong node ID. So the quit seems to be final.

Hope you and your team are happy on this result causing operators to quit frustrated. Thats not how community works.

There is no way to detect a wrong identity or data location, if you overwrite it with setup command.

IP is not tied with the identity, so seems you did not generate a new identity and tried to run the node with an old one.
If you want to start from scratch, you need to generate a new identity, sign it with a new authorization token and run your node with clean storage and this new identity.
Please do not use the binding in the form you have used before, use this one:

Exactly what I did! Even more… I’ve deleted lvm-volume, created new one for storage, created new identity & authorized it, deleted storj-container and re-created everything from scratch.

Could you please elaborate, what is an issue?

WARN contact:service failed PingMe request to satellite {“Process”: “storagenode”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “error”: “ping satellite: check-in network: failed to ping node (ID: 1ZbQyH865u9q4U5Ey5zEcwYTA2dKHkoozHzM39BLKtjW3gQtv6) at address: my-node-hostname:28967, err: contact: failed to ping storage node using QUIC, your node indicated error code: 0, rpc: quic: timeout: no recent network activity”, “errorVerbose”: “ping satellite: check-in network: failed to ping node (ID: 1ZbQyH865u9q4U5Ey5zEcwYTA2dKHkoozHzM39BLKtjW3gQtv6) at address: my-node-hostname:28967, err: contact: failed to ping storage node using QUIC, your node indicated error code: 0, rpc: quic: timeout: no recent network activity\n\tstorj.io/storj/storagenode/contact.(*Service).requestPingMeOnce:194\n\tstorj.io/storj/storagenode/contact.(*Service).RequestPingMeQUIC:167\n\tstorj.io/storj/storagenode.(*Peer).addConsoleService:868\n\tstorj.io/storj/storagenode.(*Peer).Run:907\n\tmain.cmdRun:251\n\tstorj.io/private/process.cleanup.func1.4:377\n\tstorj.io/private/process.cleanup.func1:395\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfigAndLogger:92\n\tmain.main:478\n\truntime.main:255”}

Nothing changed at firewall and trying to connect using “nc -vz my-node-hostname:28967” works from my local machine!

If you ping my-node-hostname from the local pc does it show a NAT IP (10.x.x.x or 192.168.x.x.x) or an external public IP? The my-node-hostname should resolve anywhere on the internet to a public IP.

I’ve replaced the original hostname with this stub-name ‘my-node-hostname’

I understand. You also wouldn’t be the first person to use an internally accessible hostname/IP for what should be a publicly accessible node.

Please also make sure that you have TCP and UDP port mappings in your docker-compose.yaml file, i.e.:

services:
  storagenode2:
    ports:
      - 28967:28967/tcp
      - 28967:28967/udp
      - 14002:14002

I’ve completly removed docker, reinstalled docker and everything to make shure there’s nothing skrewed.

Reason:
nc -vz my-public-ip:28967
nc: connect to 91.210.xxx.xxx port 28967 (tcp) failed: Connection refused

No matter if on docker-host or my workstation!

But …
nc -vz 127.0.0.1 28967
Connection to 127.0.0.1 28967 port [tcp/*] succeeded!

ufw/iptables is configured to forward port 28967 udp/tcp

# iptables -L | grep 28967           
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:28967
ACCEPT     udp  --  anywhere             anywhere             udp dpt:28967

I’m totally confused … what the hell is going on here?

Other services running on docker work fine for example portainer, nginx-ingress, …

In fact… something is skrewed for some reason. By the way this time I also cleaned /var/lib/docker and started totally from scratch

hmm… just installed netstat inside storj-container and checked. Theres no port 28967 anymore inside storj-node container!!!

What the hell… is the port for external mapping to 28967 now 7778??!

# netstat -pant
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.11:39891        0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:37149         0.0.0.0:*               LISTEN      7/storagenode       
tcp        0      0 127.0.0.1:7778          0.0.0.0:*               LISTEN      7/storagenode       
...
tcp6       0      0 :::7777                 :::*                    LISTEN      7/storagenode       
tcp6       0      0 :::14002                :::*                    LISTEN      7/storagenode  

Okay… don’t know why but in my docker-compose.yaml services.storj.environment the SETUP=“true” env did not trigger setup to create new config.yaml for my new identity …

I’ve been running ./storagenode setup --config-dir ./config/ --identity-dir ./identity/
and then it started working with new identity …

really crappy

Is your node online?

By the way, to make this command work from the same network your router should support a hairpin NAT (allows connection to its own external interface from the local network).
But in this case, your storagenode was most likely not running and the port was closed.

Hi @Alexey yes the fresh node is now online (and still is). For some reason the initial setup (using SETUP=“true” ENV) didn’t fire up, but executing it manually helped.

By the way just noticed I’ve published the IP. I’m not a friend of duing such things, could you do me a favor and mask the IP like this " 91.210.xxx.xxx"? I already did that in my original post you quoted :slight_smile: