I started graceful-exit on one of my nodes the 5th of October.
At that moment, the new graceful-exit system was not live yet, but it went live 10~15 days later (or so).
Today, we’re 33 days after the start of the procedure, and all satellites are marked as completed, but one (eu1.storj.io):
root@raspberrypi:/home/pac# docker exec -it storj_node_3 /app/storagenode exit-status --config-dir /app/config
2023-11-07T09:30:58Z INFO Configuration loaded {"process": "storagenode", "Location": "/app/config/config.yaml"}
2023-11-07T09:30:58Z INFO Anonymized tracing enabled {"process": "storagenode"}
2023-11-07T09:30:58Z INFO Identity loaded. {"process": "storagenode", "Node ID": "12WaY38WSMX14u7zshBdTvyvK7pwvr57dKdHsQPyBb3gtMvmJoE"}
Domain Name Node ID Percent Complete Successful Completion Receipt
saltlake.tardigrade.io:7777 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE 100.00% Y 0a47304502202a2f4a80070d43c1af0f2c8f60bec56e10fe59c7f45d6331847590378a39d784022100dcfc274a84a62bf1e555b20867e47794cef4b63cb7eb698cdf3879d63edeed0a12207b2de9d72c2e935f1918c058caaf8ed00f0581639008707317ff1bd0000000001a20c6db64360bcbf6872f459b8e2e74f44b84d112ae3a67dd50f783400000000000220c0894ac9aaa0610a0eabd8702
ap1.storj.io:7777 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 100.00% Y 0a483046022100e994e35866b3c8782855761aeafffb2b556a1eef13f42cf1008afa684824bde7022100b49e28a8002ff2b2f42da1a06332c42c188c963e5681042a9506cd1026ee3b59122084a74c2cd43c5ba76535e1f42f5df7c287ed68d33522782f4afabfdb400000001a20c6db64360bcbf6872f459b8e2e74f44b84d112ae3a67dd50f783400000000000220c08badb9baa0610909da7aa02
us1.storj.io:7777 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S 100.00% Y 0a473045022100a7fdbb55ab661550788ed2da3dcc6b7bfc96a7320da89c5c37d37b70d0b83f6e02204a7b6a658f333db7967271b262eac03d62043d2b18c06a34a3ebbd99ee4d7b121220a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec00000001a20c6db64360bcbf6872f459b8e2e74f44b84d112ae3a67dd50f783400000000000220b08cbbb9eaa06108894ed3f
eu1.storj.io:7777 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs 8.71% N N/A
I know the percentage does not mean anything (other than the fact that it had transfered 8.71% data while the old graceful-exit system was still in place).
Is this expected? I thought the graceful-exit new mechanism was supposed to automatically end after 30 days.
So, I’m not sure if I can safely shut down this node yet. On the dashboard, all satellites have an activity chart that looks like this, which clearly shows that they ended all transfers:
what information regarding exit status do you see when you give the command storagenode exit-status
Does the exit status percentage still increase for EU1?
You can see the result of /app/storagenode exit-status in my previous post.
If I run it again right now (5+ hours later), here is the output:
root@raspberrypi:/home/pac# docker exec -it storj_node_3 /app/storagenode exit-status --config-dir /app/config
2023-11-07T14:54:57Z INFO Configuration loaded {"process": "storagenode", "Location": "/app/config/config.yaml"}
2023-11-07T14:54:57Z INFO Anonymized tracing enabled {"process": "storagenode"}
2023-11-07T14:54:57Z INFO Identity loaded. {"process": "storagenode", "Node ID": "12WaY38WSMX14u7zshBdTvyvK7pwvr57dKdHsQPyBb3gtMvmJoE"}
Domain Name Node ID Percent Complete Successful Completion Receipt
saltlake.tardigrade.io:7777 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE 100.00% Y 0a47304502202a2f4a80070d43c1af0f2c8f60bec56e10fe59c7f45d6331847590378a39d784022100dcfc274a84a62bf1e555b20867e47794cef4b63cb7eb698cdf3879d63edeed0a12207b2de9d72c2e935f1918c058caaf8ed00f0581639008707317ff1bd0000000001a20c6db64360bcbf6872f459b8e2e74f44b84d112ae3a67dd50f783400000000000220c0894ac9aaa0610a0eabd8702
ap1.storj.io:7777 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 100.00% Y 0a483046022100e994e35866b3c8782855761aeafffb2b556a1eef13f42cf1008afa684824bde7022100b49e28a8002ff2b2f42da1a06332c42c188c963e5681042a9506cd1026ee3b59122084a74c2cd43c5ba76535e1f42f5df7c287ed68d33522782f4afabfdb400000001a20c6db64360bcbf6872f459b8e2e74f44b84d112ae3a67dd50f783400000000000220c08badb9baa0610909da7aa02
us1.storj.io:7777 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S 100.00% Y 0a473045022100a7fdbb55ab661550788ed2da3dcc6b7bfc96a7320da89c5c37d37b70d0b83f6e02204a7b6a658f333db7967271b262eac03d62043d2b18c06a34a3ebbd99ee4d7b121220a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec00000001a20c6db64360bcbf6872f459b8e2e74f44b84d112ae3a67dd50f783400000000000220b08cbbb9eaa06108894ed3f
eu1.storj.io:7777 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs 8.71% N N/A
=> Percentage did not move: Still at 8.71%.
But my understanding is that it’s not supposed to change anymore with the new graceful-exit system anyway, so why would it change?
Our devs are looking into this issue now. Please keep the node online in the mean time. And you are right, the percent completed in the new Graceful Exit is no longer meaningful.
I’m entirely baffled so far. Everything looks like it’s working, yet it’s clearly been more than 30 days and your node hasn’t been marked as successfully exited yet. It’s also not marked disqualified, suspended, or failed. And it’s definitely still checking in.
Would you mind terribly turning on debug logging on your node? It won’t require a restart, and it’s not much more verbose than info level. Assuming you haven’t changed the configuration for the debug port, you can accomplish this using the following clunky command:
Getting debug logs for only a few minutes ought to be enough to say whether your node has a running gracefulexit worker for the eu1 satellite, and whether the worker is actually trying to contact the satellite gracefulexit endpoint.
Edit 2023-11-08T22:01:42Z: from what I can see, your node is not trying to contact the graceful exit endpoint. It is checking in every hour as it should, but not asking if it’s done with GE. It has to do so in order for the satellite to say “yes, you’re done” and mark it as complete. I can’t yet tell why it isn’t. My best theory is that something went wrong with the node’s satellite db, and it doesn’t remember that it’s supposed to be gracefully exiting. The debug logs might explain why, or maybe a copy of your satellites.db.
I’m not sure if I did something wrong, but the provided command doesn’t seem to work, although I did not change anything with regards to the debug port, AFAIK…
Received! Thank you. Everything looks correct in the satellites.db, unfortunately (unfortunately because it would have been a good clue :D). I’m now very curious to see what is going on in your debug logs. Unfortunately, after doing a whole lot of digging, I think that our storagenode docker image has a flaw, where the updater process opens a debug port and the node itself does not. Yours might not have a debug port open at all.
If we can do just a small amount of extra checking to be sure:
If you try docker exec -it storagenode cat config/config.yaml, you should see a short config file that I think will end with the line
debug.addr: ":5999"
If it is set to a different number, we can retry the wget command and maybe it will help, or maybe not.
If it is set to 5999, there might be nothing we can do aside from restarting the node. I expect that, at least, will cause it to restart the graceful exit worker and complete the exit. I understand if you don’t want to do that and incur the extra disk load from the filewalker. If you’d prefer, I can just mark your exit as completed manually (because of course it should be).
If you do choose to restart the node, could you run it with LOG_LEVEL=debug ?
@thepaul I’m okay with trying out a few things to help you investigate. I can restart the node, but to be sure: What should I do before and after ? I suppose something along those lines:
Reconfigure config.yaml with debug.* properties ? (please let me know what values to set).
Enable “debug” log level by setting log.level: debug in config.yaml.
Restart the node.
Try the docker exec -it storj_node_3 wget [...] -O - http://localhost:5999/logging command again, and capture its output for a few minutes.
Does that sound right? (Please let me know what properties I should uncomment and set to what values).
Oh interesting. If it has 127.0.0.1:0, then it will have chosen any available port for its debug port- we just have to figure out what it is. We probably don’t need to restart it after all. I must have done something funny to my own node to have 5999 hardcoded in the yaml.
I think this bunch of awk will suss out the right debug port (or at least give us a few options):
That probably prints out 14002, 7778, 28967, plus one more number, and the “one more number” should be the right port. Then you just plug that into the wget command:
However, the only debug port is 7777 and it belongs to the updater. The port 39245 do not response at all, 7778 is a local port for the CLI dashboard (and graceful exit, …), 14002 is a console port, 28967 node’s port.
So I do not believe that the debug is enabled on storagenode without an explicit option.
In fact that’s exactly what I was expecting! I don’t know why there are three ports, but they all responded with {"level":"debug"} which means they successfully set the debug level.
the node was working on deleting pieces after completing its graceful exit from US1 and AP1
graceful exit for AP1 completed at 2023-11-05T01:19Z
it was deleting AP1 pieces from 2023-11-05T01:19Z to 2023-11-05T13:47Z
graceful exit for US1 completed at 2023-11-05T13:51Z
it was deleting US1 pieces from 2023-11-05T13:51Z to 2023-11-09T09:05Z
graceful exit for EU1 completed at 2023-11-09T09:04Z
it has been deleting EU1 pieces since then
So clearly, the node is waiting until cleanup for one satellite is done before allowing the next graceful exit to complete. Which is good and fine, because it should reduce thrashing on the disk. The only problem is, I don’t see where in the code we’ve asked it to wait like that! From what I can see, it’s supposed to allow GE to complete on all the satellites at the same time if that’s what the satellites say, and not worry about cleanup contention. But I guess it’s not worth a lot more investigation right now.
TL;DR: everything was fine and we just needed to wait for cleanup of another graceful exit to complete.
Thank you @thepaul & @heunland & @Alexey for your fast answers and detailed investigations.
It’s nice to see you guys are still carefully looking after your software.
Well, I’d say that even though it may be good, this would be a pretty good reason to display the progress on nodes’ dashboards (or at least its status) so we’re aware when the graceful-exit is actually finished, as suggested there:
Now that we know a node can take more than 30 days to exit (on its side at least?), I’m wondering if @arrogantrabbit’s comment there is still valid:
Could a process be stalling because a disk is too busy cleaning up some files? Most of my disks were SMR, which caused incredibly long operations sometimes (the filewalker could take more than 36 hours to complete!).
I mean, yes in theory, but this thread is all about one of my nodes that did not finish on time: it took an extra 4+ days to fully finish its graceful-exit.
So it might not be as simple as waiting for 30 days to shut down a graceful-exiting node.
But maybe on satellites’ side, it’s all done after 30 days and nodes can be safely shut down after 30 days even though they do not report it as finished… I thought it was worth investigating though.