My Node is causing 100% workload on my disk without having any traffic, not a single piece is getting up- or downloaded since 2 days (February 28th-29th). And yes, the node seems to be online (according to CLI Dashboard and uptimemonitor.com)
Acording to the system monitor the workload is mainly caused by reading processes.
I also noticed that there were lot of “upload failed”-errors (12004 in 2 days (February 26th-27th) or a successrate of 35%) and suspect that it has to do with the high workload.
And how could it be differently, after hours of thinking “maybe it will end soon”, and finally writing this post, all of a sudden the workload stops. Now there’s no workload (nor traffic)
i think it is not only on your node, if you see space used, i think you will find out that you are geting more and more space free. Today i see than my node got 50GB free more free space.
So there is just lot of delete operations.
I did not see the part in code, but saw from the behavior - IO at 100% (write) and the trash directory filling up at about the same rate as IO write load with the total used space staying about the same.
There is no freed space, still 4TB used as before (according to CLI Dashboard).
And so far I have not a singe byte of traffic this month.
@Pentium100 As I mentioned, the workload came mainly from reading processes, there was almost no writing process.
And I also couldn’t understand why they would move the whole content of a disk into another directory. IMHO this would be a really unfortunate design decision for the following reasons:
knocking out nodes by overwhelming workload decreases the entire network perfomance.
forces hardware to brake earlier by double the read/write cycles.
Why would they do this if they also could simply mark the pieces as deleted (in the database)?
That would not create a lot of IO, if it was done using the equivalent of “mv old new” instead of “cp old new && rm old”. Again, I have not looked at the code, I think this works like that from the way I saw the node behave.
I missed that, sorry. The only time my node gets a lot of read IO is when it is restarted (after the whole VM is restarted so the cache is empty) as it is reading the database.
Guess the restart is causing this behaviour. Thanks for your answer.
In a desperate move I deleted my node and set it up again. This helped before, but not now - now the node is offline. And I can’t see why. Port forwarding is working on another port, but the SN-Port apears closed from outside.
Well, as I read in another post it can take a while until it gets back online. I take a look again after my afternoon study session… (in 6-10 hours)
I removed the docker container and set it up again, but this time I added -p 14002:14002 in order to access the gui dashboard. And on the gui dashboard I could see that my node got disqualified on all satelites. That was a hard insight after running it for 9 Months successfully.
My conclusion was that I can’t do anything once it’s disqualified and that I have to take the loss of the withholdings. So I set up a new Node with a Identity and I’m deleting the data now. I however made a copy of the log file, but I actually have no time to really investigate in what happend.
I guess it went partially offline about a week ago when I had to change subnets in my network to make my mining farm run again. But I had the feeling that I did update port forwarding correctly. I also restarted router and Node (synology ds218+) and uptimerobot.com did not report the node to be down (better said the port to be unresponsive)…
Not sure how this happend, but if I remember right I had a similar situation where the port stayed open and reachable from the internet but the node was (at least partially) unresponsive…
But it might have been a bad decision to set up a new identity before deleting the old data. Deleting the data again causes 100% disk workload and as a side effect pretty low upload success rate.
No, the only possible way right now is failing too many audits, i.e. lost or inaccessible data.
How is your docker run looks like? Have you replaced the -v options to the --mount?
yes, must have been somewhen last year. It’s with the --mount option in my setup script
I am pretty sure there were no failed audits. Although I did not make a screenshot of successrate.sh, I did check it. And there was no sign of an error, despite of the really low upload successrate (as I mentioned in the initial post).
I also checkt the last day with traffic in the copy of the log file and it seems to me as if there was not one failed audit - but plenty of successful audits (1 entry “download startet” followed by 1 entry “downloaded” (category “GET_AUDIT”))
But maybe there were failed audits but for some reason they did not get saved in the logs.
Node ID was: 1MgCzp5ByecEfxUVm3rXpVX5aGXkyoSjBnS7kiBfeduuuLzink