Graceful-exit complete on all sats but one?

Thanks @heunland, I appreciate your fast feedback :slight_smile:

I’ll let my node online for now. Keep me posted ^^

I’m entirely baffled so far. Everything looks like it’s working, yet it’s clearly been more than 30 days and your node hasn’t been marked as successfully exited yet. It’s also not marked disqualified, suspended, or failed. And it’s definitely still checking in.

Would you mind terribly turning on debug logging on your node? It won’t require a restart, and it’s not much more verbose than info level. Assuming you haven’t changed the configuration for the debug port, you can accomplish this using the following clunky command:

docker exec -it storagenode wget --method=PUT --header='Content-Type: application/json' --body-data='{"level": "debug"}' -O - http://localhost:5999/logging

Getting debug logs for only a few minutes ought to be enough to say whether your node has a running gracefulexit worker for the eu1 satellite, and whether the worker is actually trying to contact the satellite gracefulexit endpoint.

Edit 2023-11-08T22:01:42Z: from what I can see, your node is not trying to contact the graceful exit endpoint. It is checking in every hour as it should, but not asking if it’s done with GE. It has to do so in order for the satellite to say “yes, you’re done” and mark it as complete. I can’t yet tell why it isn’t. My best theory is that something went wrong with the node’s satellite db, and it doesn’t remember that it’s supposed to be gracefully exiting. The debug logs might explain why, or maybe a copy of your satellites.db.

3 Likes

I’m not sure if I did something wrong, but the provided command doesn’t seem to work, although I did not change anything with regards to the debug port, AFAIK…

root@raspberrypi:/home/pac# docker exec -it storj_node_3 wget --method=PUT --header='Content-Type: application/json' --body-data='{"level": "debug"}' -O - http://localhost:5999/logging
--2023-11-08 23:02:41--  http://localhost:5999/logging
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:5999... failed: Connection refused.
Connecting to localhost (localhost)|::1|:5999... failed: Cannot assign requested address.
Retrying.

--2023-11-08 23:02:42--  (try: 2)  http://localhost:5999/logging
Connecting to localhost (localhost)|127.0.0.1|:5999... failed: Connection refused.
Connecting to localhost (localhost)|::1|:5999... failed: Cannot assign requested address.
Retrying.

...

@thepaul I’ll send you the satellites.db file in PM though, as I did not check whether it contains sensitive data or not… ^^

Received! Thank you. Everything looks correct in the satellites.db, unfortunately (unfortunately because it would have been a good clue :D). I’m now very curious to see what is going on in your debug logs. Unfortunately, after doing a whole lot of digging, I think that our storagenode docker image has a flaw, where the updater process opens a debug port and the node itself does not. Yours might not have a debug port open at all.

If we can do just a small amount of extra checking to be sure:

If you try docker exec -it storagenode cat config/config.yaml, you should see a short config file that I think will end with the line

debug.addr: ":5999"

If it is set to a different number, we can retry the wget command and maybe it will help, or maybe not.

If it is set to 5999, there might be nothing we can do aside from restarting the node. I expect that, at least, will cause it to restart the graceful exit worker and complete the exit. I understand if you don’t want to do that and incur the extra disk load from the filewalker. If you’d prefer, I can just mark your exit as completed manually (because of course it should be).

If you do choose to restart the node, could you run it with LOG_LEVEL=debug ?

3 Likes

Looks like it should be reverse - the updater shouldn’t use a debug port and the node - should.

2 Likes

Here are all entries related to “debug.” in my config.yaml file:

root@raspberrypi:/home/pac# docker exec -it storj_node_3 cat config/config.yaml | grep "debug\."
# debug.addr: 127.0.0.1:0
# debug.control: false
# debug.profilername: ""
# debug.trace-out: ""

@thepaul I’m okay with trying out a few things to help you investigate. I can restart the node, but to be sure: What should I do before and after ? I suppose something along those lines:

  • Reconfigure config.yaml with debug.* properties ? (please let me know what values to set).
  • Enable “debug” log level by setting log.level: debug in config.yaml.
  • Restart the node.
  • Try the docker exec -it storj_node_3 wget [...] -O - http://localhost:5999/logging command again, and capture its output for a few minutes.

Does that sound right? (Please let me know what properties I should uncomment and set to what values).

Oh interesting. If it has 127.0.0.1:0, then it will have chosen any available port for its debug port- we just have to figure out what it is. We probably don’t need to restart it after all. I must have done something funny to my own node to have 5999 hardcoded in the yaml.

I think this bunch of awk will suss out the right debug port (or at least give us a few options):

docker exec -i storagenode awk '$4=="0A" {split($2,a,":");print 0+("0x"a[2])}' /proc/net/tcp{,6}

That probably prints out 14002, 7778, 28967, plus one more number, and the “one more number” should be the right port. Then you just plug that into the wget command:

docker exec -it storagenode wget --method=PUT --header='Content-Type: application/json' --body-data='{"level": "debug"}' -O - http://localhost:$DEBUGPORT/logging

THEN! (hah.) We should get some useful debug logging.

1 Like

In my node it has these ports:

7778
39245
7777
28967
14002

However, the only debug port is 7777 and it belongs to the updater. The port 39245 do not response at all, 7778 is a local port for the CLI dashboard (and graceful exit, …), 14002 is a console port, 28967 node’s port.
So I do not believe that the debug is enabled on storagenode without an explicit option.

1 Like

Dang. I’m going to get some sleep (it’s late here) and maybe the problem will be obvious in the morning.

1 Like

I have these ports open:

41817
38069
7778   # CLI Dashboard
34451
14002  # Console
28967  # Node's port

So I tried the command with the 3 remaining ports 41817, 38069 & 34451:

root@raspberrypi:/home/pac# docker exec -it storj_node_3 wget --method=PUT --header='Content-Type: application/json' --body-data='{"level": "debug"}' -O - http://localhost:41817/logging
--2023-11-09 09:39:14--  http://localhost:41817/logging
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:41817... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18 [text/plain]
Saving to: 'STDOUT'

-                                                      0%[                                                                                                                      ]       0  --.-KB/s               {"level":"debug"}
-                                                    100%[=====================================================================================================================>]      18  --.-KB/s    in 0s      

2023-11-09 09:39:14 (914 KB/s) - written to stdout [18/18]

root@raspberrypi:/home/pac# docker exec -it storj_node_3 wget --method=PUT --header='Content-Type: application/json' --body-data='{"level": "debug"}' -O - http://localhost:38069/logging
--2023-11-09 09:39:22--  http://localhost:38069/logging
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:38069... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18 [text/plain]
Saving to: 'STDOUT'

-                                                      0%[                                                                                                                      ]       0  --.-KB/s               {"level":"debug"}
-                                                    100%[=====================================================================================================================>]      18  --.-KB/s    in 0s      

2023-11-09 09:39:22 (101 KB/s) - written to stdout [18/18]

root@raspberrypi:/home/pac# docker exec -it storj_node_3 wget --method=PUT --header='Content-Type: application/json' --body-data='{"level": "debug"}' -O - http://localhost:34451/logging
--2023-11-09 09:39:28--  http://localhost:34451/logging
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:34451... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18 [text/plain]
Saving to: 'STDOUT'

-                     0%[                    ]       0  --.-KB/s               {"level":"debug"}
-                                                    100%[=====================================================================================================================>]      18  --.-KB/s    in 0s      

2023-11-09 09:39:28 (85.4 KB/s) - written to stdout [18/18]

Doesn’t look like what you’re expecting ^^’

In fact that’s exactly what I was expecting! I don’t know why there are three ports, but they all responded with {"level":"debug"} which means they successfully set the debug level.

Is there any DEBUG logging from your node now?

After working with @Pac over PM, what we know is:

  • the node was working on deleting pieces after completing its graceful exit from US1 and AP1
  • graceful exit for AP1 completed at 2023-11-05T01:19Z
  • it was deleting AP1 pieces from 2023-11-05T01:19Z to 2023-11-05T13:47Z
  • graceful exit for US1 completed at 2023-11-05T13:51Z
  • it was deleting US1 pieces from 2023-11-05T13:51Z to 2023-11-09T09:05Z
  • graceful exit for EU1 completed at 2023-11-09T09:04Z
  • it has been deleting EU1 pieces since then

So clearly, the node is waiting until cleanup for one satellite is done before allowing the next graceful exit to complete. Which is good and fine, because it should reduce thrashing on the disk. The only problem is, I don’t see where in the code we’ve asked it to wait like that! From what I can see, it’s supposed to allow GE to complete on all the satellites at the same time if that’s what the satellites say, and not worry about cleanup contention. But I guess it’s not worth a lot more investigation right now.

TL;DR: everything was fine and we just needed to wait for cleanup of another graceful exit to complete.

3 Likes

Thank you @thepaul & @heunland & @Alexey for your fast answers and detailed investigations.
It’s nice to see you guys are still carefully looking after your software.

Well, I’d say that even though it may be good, this would be a pretty good reason to display the progress on nodes’ dashboards (or at least its status) so we’re aware when the graceful-exit is actually finished, as suggested there:

Now that we know a node can take more than 30 days to exit (on its side at least?), I’m wondering if @arrogantrabbit’s comment there is still valid:

Could a process be stalling because a disk is too busy cleaning up some files? Most of my disks were SMR, which caused incredibly long operations sometimes (the filewalker could take more than 36 hours to complete!).

After 30 days period the exit would have either succeed or failed, depending you node uptime during that time and nothing else:

I mean, yes in theory, but this thread is all about one of my nodes that did not finish on time: it took an extra 4+ days to fully finish its graceful-exit.

So it might not be as simple as waiting for 30 days to shut down a graceful-exiting node.

But maybe on satellites’ side, it’s all done after 30 days and nodes can be safely shut down after 30 days even though they do not report it as finished… I thought it was worth investigating though.

That is my understanding, yes, that as long as you started the exit and kept uptime — in 30 days it’s auto-success.

Alright, but I’d still be interested to have a Storjling’s view on this, because my node still had activity and egress with the “late” satellite (EU1), which makes me think that satellite was still using my node although the graceful-exit procedure was supposed to be completed.

Activity on EU1:

Activity on another satellite that finished on time (Saltlake):

1 Like

You need to confirm the progress for sure

and you may shutdown the node when you see 100% or disqualified for all satellites.

2 Likes

It’s important to note that if you leave your node running after it has been disqualified or gracefully exited, the network will keep on using your node opportunistically. You just won’t be paid for it!

2 Likes

Yeah, it’s not entirely true, apparently, until we get this patiently-exiting-node problem fixed.

3 Likes