Graceful-exit complete on all sats but one?

Pac · November 7, 2023, 9:56am

Hi

I started graceful-exit on one of my nodes the 5th of October.
At that moment, the new graceful-exit system was not live yet, but it went live 10~15 days later (or so).

Today, we’re 33 days after the start of the procedure, and all satellites are marked as completed, but one (eu1.storj.io):

root@raspberrypi:/home/pac# docker exec -it storj_node_3 /app/storagenode exit-status --config-dir /app/config
2023-11-07T09:30:58Z	INFO	Configuration loaded	{"process": "storagenode", "Location": "/app/config/config.yaml"}
2023-11-07T09:30:58Z	INFO	Anonymized tracing enabled	{"process": "storagenode"}
2023-11-07T09:30:58Z	INFO	Identity loaded.	{"process": "storagenode", "Node ID": "12WaY38WSMX14u7zshBdTvyvK7pwvr57dKdHsQPyBb3gtMvmJoE"}

Domain Name                  Node ID                                              Percent Complete  Successful  Completion Receipt
saltlake.tardigrade.io:7777  1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE   100.00%           Y           0a47304502202a2f4a80070d43c1af0f2c8f60bec56e10fe59c7f45d6331847590378a39d784022100dcfc274a84a62bf1e555b20867e47794cef4b63cb7eb698cdf3879d63edeed0a12207b2de9d72c2e935f1918c058caaf8ed00f0581639008707317ff1bd0000000001a20c6db64360bcbf6872f459b8e2e74f44b84d112ae3a67dd50f783400000000000220c0894ac9aaa0610a0eabd8702    
ap1.storj.io:7777            121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6  100.00%           Y           0a483046022100e994e35866b3c8782855761aeafffb2b556a1eef13f42cf1008afa684824bde7022100b49e28a8002ff2b2f42da1a06332c42c188c963e5681042a9506cd1026ee3b59122084a74c2cd43c5ba76535e1f42f5df7c287ed68d33522782f4afabfdb400000001a20c6db64360bcbf6872f459b8e2e74f44b84d112ae3a67dd50f783400000000000220c08badb9baa0610909da7aa02  
us1.storj.io:7777            12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S  100.00%           Y           0a473045022100a7fdbb55ab661550788ed2da3dcc6b7bfc96a7320da89c5c37d37b70d0b83f6e02204a7b6a658f333db7967271b262eac03d62043d2b18c06a34a3ebbd99ee4d7b121220a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec00000001a20c6db64360bcbf6872f459b8e2e74f44b84d112ae3a67dd50f783400000000000220b08cbbb9eaa06108894ed3f      
eu1.storj.io:7777            12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs  8.71%             N           N/A

I know the percentage does not mean anything (other than the fact that it had transfered 8.71% data while the old graceful-exit system was still in place).

Is this expected? I thought the graceful-exit new mechanism was supposed to automatically end after 30 days.

So, I’m not sure if I can safely shut down this node yet. On the dashboard, all satellites have an activity chart that looks like this, which clearly shows that they ended all transfers:

But “eu1.storj.io” is still active:

What’s up with this satellite? What can I do?
Thanks ^^

heunland · November 7, 2023, 2:17pm

what information regarding exit status do you see when you give the command
storagenode exit-status
Does the exit status percentage still increase for EU1?

Pac · November 7, 2023, 2:58pm

You can see the result of /app/storagenode exit-status in my previous post.

If I run it again right now (5+ hours later), here is the output:

root@raspberrypi:/home/pac# docker exec -it storj_node_3 /app/storagenode exit-status --config-dir /app/config
2023-11-07T14:54:57Z	INFO	Configuration loaded	{"process": "storagenode", "Location": "/app/config/config.yaml"}
2023-11-07T14:54:57Z	INFO	Anonymized tracing enabled	{"process": "storagenode"}
2023-11-07T14:54:57Z	INFO	Identity loaded.	{"process": "storagenode", "Node ID": "12WaY38WSMX14u7zshBdTvyvK7pwvr57dKdHsQPyBb3gtMvmJoE"}

Domain Name                  Node ID                                              Percent Complete  Successful  Completion Receipt
saltlake.tardigrade.io:7777  1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE   100.00%           Y           0a47304502202a2f4a80070d43c1af0f2c8f60bec56e10fe59c7f45d6331847590378a39d784022100dcfc274a84a62bf1e555b20867e47794cef4b63cb7eb698cdf3879d63edeed0a12207b2de9d72c2e935f1918c058caaf8ed00f0581639008707317ff1bd0000000001a20c6db64360bcbf6872f459b8e2e74f44b84d112ae3a67dd50f783400000000000220c0894ac9aaa0610a0eabd8702    
ap1.storj.io:7777            121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6  100.00%           Y           0a483046022100e994e35866b3c8782855761aeafffb2b556a1eef13f42cf1008afa684824bde7022100b49e28a8002ff2b2f42da1a06332c42c188c963e5681042a9506cd1026ee3b59122084a74c2cd43c5ba76535e1f42f5df7c287ed68d33522782f4afabfdb400000001a20c6db64360bcbf6872f459b8e2e74f44b84d112ae3a67dd50f783400000000000220c08badb9baa0610909da7aa02  
us1.storj.io:7777            12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S  100.00%           Y           0a473045022100a7fdbb55ab661550788ed2da3dcc6b7bfc96a7320da89c5c37d37b70d0b83f6e02204a7b6a658f333db7967271b262eac03d62043d2b18c06a34a3ebbd99ee4d7b121220a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec00000001a20c6db64360bcbf6872f459b8e2e74f44b84d112ae3a67dd50f783400000000000220b08cbbb9eaa06108894ed3f      
eu1.storj.io:7777            12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs  8.71%             N           N/A

=> Percentage did not move: Still at 8.71%.
But my understanding is that it’s not supposed to change anymore with the new graceful-exit system anyway, so why would it change?

heunland · November 7, 2023, 2:59pm

Our devs are looking into this issue now. Please keep the node online in the mean time. And you are right, the percent completed in the new Graceful Exit is no longer meaningful.

Pac · November 7, 2023, 4:51pm

Thanks @heunland, I appreciate your fast feedback

I’ll let my node online for now. Keep me posted ^^

thepaul · November 8, 2023, 6:48pm

I’m entirely baffled so far. Everything looks like it’s working, yet it’s clearly been more than 30 days and your node hasn’t been marked as successfully exited yet. It’s also not marked disqualified, suspended, or failed. And it’s definitely still checking in.

Would you mind terribly turning on debug logging on your node? It won’t require a restart, and it’s not much more verbose than info level. Assuming you haven’t changed the configuration for the debug port, you can accomplish this using the following clunky command:

docker exec -it storagenode wget --method=PUT --header='Content-Type: application/json' --body-data='{"level": "debug"}' -O - http://localhost:5999/logging

Getting debug logs for only a few minutes ought to be enough to say whether your node has a running gracefulexit worker for the eu1 satellite, and whether the worker is actually trying to contact the satellite gracefulexit endpoint.

Edit 2023-11-08T22:01:42Z: from what I can see, your node is not trying to contact the graceful exit endpoint. It is checking in every hour as it should, but not asking if it’s done with GE. It has to do so in order for the satellite to say “yes, you’re done” and mark it as complete. I can’t yet tell why it isn’t. My best theory is that something went wrong with the node’s satellite db, and it doesn’t remember that it’s supposed to be gracefully exiting. The debug logs might explain why, or maybe a copy of your satellites.db.

Pac · November 8, 2023, 11:09pm

I’m not sure if I did something wrong, but the provided command doesn’t seem to work, although I did not change anything with regards to the debug port, AFAIK…

root@raspberrypi:/home/pac# docker exec -it storj_node_3 wget --method=PUT --header='Content-Type: application/json' --body-data='{"level": "debug"}' -O - http://localhost:5999/logging
--2023-11-08 23:02:41--  http://localhost:5999/logging
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:5999... failed: Connection refused.
Connecting to localhost (localhost)|::1|:5999... failed: Cannot assign requested address.
Retrying.

--2023-11-08 23:02:42--  (try: 2)  http://localhost:5999/logging
Connecting to localhost (localhost)|127.0.0.1|:5999... failed: Connection refused.
Connecting to localhost (localhost)|::1|:5999... failed: Cannot assign requested address.
Retrying.

...

@thepaul I’ll send you the satellites.db file in PM though, as I did not check whether it contains sensitive data or not… ^^

thepaul · November 8, 2023, 11:52pm

Received! Thank you. Everything looks correct in the satellites.db, unfortunately (unfortunately because it would have been a good clue :D). I’m now very curious to see what is going on in your debug logs. Unfortunately, after doing a whole lot of digging, I think that our storagenode docker image has a flaw, where the updater process opens a debug port and the node itself does not. Yours might not have a debug port open at all.

If we can do just a small amount of extra checking to be sure:

If you try docker exec -it storagenode cat config/config.yaml, you should see a short config file that I think will end with the line

debug.addr: ":5999"

If it is set to a different number, we can retry the wget command and maybe it will help, or maybe not.

If it is set to 5999, there might be nothing we can do aside from restarting the node. I expect that, at least, will cause it to restart the graceful exit worker and complete the exit. I understand if you don’t want to do that and incur the extra disk load from the filewalker. If you’d prefer, I can just mark your exit as completed manually (because of course it should be).

If you do choose to restart the node, could you run it with LOG_LEVEL=debug ?

Alexey · November 9, 2023, 5:07am

Looks like it should be reverse - the updater shouldn’t use a debug port and the node - should.

Pac · November 9, 2023, 7:04am

Here are all entries related to “debug.” in my config.yaml file:

root@raspberrypi:/home/pac# docker exec -it storj_node_3 cat config/config.yaml | grep "debug\."
# debug.addr: 127.0.0.1:0
# debug.control: false
# debug.profilername: ""
# debug.trace-out: ""

@thepaul I’m okay with trying out a few things to help you investigate. I can restart the node, but to be sure: What should I do before and after ? I suppose something along those lines:

Reconfigure config.yaml with debug.* properties ? (please let me know what values to set).
Enable “debug” log level by setting log.level: debug in config.yaml.
Restart the node.
Try the docker exec -it storj_node_3 wget [...] -O - http://localhost:5999/logging command again, and capture its output for a few minutes.

Does that sound right? (Please let me know what properties I should uncomment and set to what values).

thepaul · November 9, 2023, 7:23am

Oh interesting. If it has 127.0.0.1:0, then it will have chosen any available port for its debug port- we just have to figure out what it is. We probably don’t need to restart it after all. I must have done something funny to my own node to have 5999 hardcoded in the yaml.

I think this bunch of awk will suss out the right debug port (or at least give us a few options):

docker exec -i storagenode awk '$4=="0A" {split($2,a,":");print 0+("0x"a[2])}' /proc/net/tcp{,6}

That probably prints out 14002, 7778, 28967, plus one more number, and the “one more number” should be the right port. Then you just plug that into the wget command:

docker exec -it storagenode wget --method=PUT --header='Content-Type: application/json' --body-data='{"level": "debug"}' -O - http://localhost:$DEBUGPORT/logging

THEN! (hah.) We should get some useful debug logging.

Alexey · November 9, 2023, 7:29am

In my node it has these ports:

However, the only debug port is 7777 and it belongs to the updater. The port 39245 do not response at all, 7778 is a local port for the CLI dashboard (and graceful exit, …), 14002 is a console port, 28967 node’s port.
So I do not believe that the debug is enabled on storagenode without an explicit option.

thepaul · November 9, 2023, 7:33am

Dang. I’m going to get some sleep (it’s late here) and maybe the problem will be obvious in the morning.

Pac · November 9, 2023, 9:54am

I have these ports open:

41817
38069
7778   # CLI Dashboard
34451
14002  # Console
28967  # Node's port

So I tried the command with the 3 remaining ports 41817, 38069 & 34451:

root@raspberrypi:/home/pac# docker exec -it storj_node_3 wget --method=PUT --header='Content-Type: application/json' --body-data='{"level": "debug"}' -O - http://localhost:41817/logging
--2023-11-09 09:39:14--  http://localhost:41817/logging
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:41817... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18 [text/plain]
Saving to: 'STDOUT'

-                                                      0%[                                                                                                                      ]       0  --.-KB/s               {"level":"debug"}
-                                                    100%[=====================================================================================================================>]      18  --.-KB/s    in 0s      

2023-11-09 09:39:14 (914 KB/s) - written to stdout [18/18]

root@raspberrypi:/home/pac# docker exec -it storj_node_3 wget --method=PUT --header='Content-Type: application/json' --body-data='{"level": "debug"}' -O - http://localhost:38069/logging
--2023-11-09 09:39:22--  http://localhost:38069/logging
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:38069... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18 [text/plain]
Saving to: 'STDOUT'

-                                                      0%[                                                                                                                      ]       0  --.-KB/s               {"level":"debug"}
-                                                    100%[=====================================================================================================================>]      18  --.-KB/s    in 0s      

2023-11-09 09:39:22 (101 KB/s) - written to stdout [18/18]

root@raspberrypi:/home/pac# docker exec -it storj_node_3 wget --method=PUT --header='Content-Type: application/json' --body-data='{"level": "debug"}' -O - http://localhost:34451/logging
--2023-11-09 09:39:28--  http://localhost:34451/logging
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:34451... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18 [text/plain]
Saving to: 'STDOUT'

-                     0%[                    ]       0  --.-KB/s               {"level":"debug"}
-                                                    100%[=====================================================================================================================>]      18  --.-KB/s    in 0s      

2023-11-09 09:39:28 (85.4 KB/s) - written to stdout [18/18]

Doesn’t look like what you’re expecting ^^’

thepaul · November 9, 2023, 2:05pm

In fact that’s exactly what I was expecting! I don’t know why there are three ports, but they all responded with {"level":"debug"} which means they successfully set the debug level.

Is there any DEBUG logging from your node now?

thepaul · November 9, 2023, 3:56pm

After working with @Pac over PM, what we know is:

the node was working on deleting pieces after completing its graceful exit from US1 and AP1
graceful exit for AP1 completed at 2023-11-05T01:19Z
it was deleting AP1 pieces from 2023-11-05T01:19Z to 2023-11-05T13:47Z
graceful exit for US1 completed at 2023-11-05T13:51Z
it was deleting US1 pieces from 2023-11-05T13:51Z to 2023-11-09T09:05Z
graceful exit for EU1 completed at 2023-11-09T09:04Z
it has been deleting EU1 pieces since then

So clearly, the node is waiting until cleanup for one satellite is done before allowing the next graceful exit to complete. Which is good and fine, because it should reduce thrashing on the disk. The only problem is, I don’t see where in the code we’ve asked it to wait like that! From what I can see, it’s supposed to allow GE to complete on all the satellites at the same time if that’s what the satellites say, and not worry about cleanup contention. But I guess it’s not worth a lot more investigation right now.

TL;DR: everything was fine and we just needed to wait for cleanup of another graceful exit to complete.

Pac · November 9, 2023, 5:46pm

Thank you @thepaul & @heunland & @Alexey for your fast answers and detailed investigations.
It’s nice to see you guys are still carefully looking after your software.

Well, I’d say that even though it may be good, this would be a pretty good reason to display the progress on nodes’ dashboards (or at least its status) so we’re aware when the graceful-exit is actually finished, as suggested there:

Now that we know a node can take more than 30 days to exit (on its side at least?), I’m wondering if @arrogantrabbit’s comment there is still valid:

Could a process be stalling because a disk is too busy cleaning up some files? Most of my disks were SMR, which caused incredibly long operations sometimes (the filewalker could take more than 36 hours to complete!).

arrogantrabbit · November 9, 2023, 6:25pm

After 30 days period the exit would have either succeed or failed, depending you node uptime during that time and nothing else:

Pac · November 9, 2023, 7:20pm

I mean, yes in theory, but this thread is all about one of my nodes that did not finish on time: it took an extra 4+ days to fully finish its graceful-exit.

So it might not be as simple as waiting for 30 days to shut down a graceful-exiting node.

But maybe on satellites’ side, it’s all done after 30 days and nodes can be safely shut down after 30 days even though they do not report it as finished… I thought it was worth investigating though.

arrogantrabbit · November 9, 2023, 7:54pm

That is my understanding, yes, that as long as you started the exit and kept uptime — in 30 days it’s auto-success.