100% Disk workload without any traffic

jorma · February 29, 2020, 9:11pm

My Node is causing 100% workload on my disk without having any traffic, not a single piece is getting up- or downloaded since 2 days (February 28th-29th). And yes, the node seems to be online (according to CLI Dashboard and uptimemonitor.com)
Acording to the system monitor the workload is mainly caused by reading processes.

I also noticed that there were lot of “upload failed”-errors (12004 in 2 days (February 26th-27th) or a successrate of 35%) and suspect that it has to do with the high workload.

And how could it be differently, after hours of thinking “maybe it will end soon”, and finally writing this post, all of a sudden the workload stops. Now there’s no workload (nor traffic)

Anybody has an idea whats going on with my node?

Vadim · February 29, 2020, 9:21pm

i think it is not only on your node, if you see space used, i think you will find out that you are geting more and more space free. Today i see than my node got 50GB free more free space.
So there is just lot of delete operations.

naxbc · February 29, 2020, 9:48pm

Probably it’s just the “house cleaning” in preparation for Production
It should be about to “hit the fan”

Pentium100 · February 29, 2020, 10:16pm

It may also be the “move” operation to trash, because the node for some reason copies the file and the deletes the original instead of just moving it.

Vadim · March 1, 2020, 7:42am

this is ununderstandeble overhed. seen thes place in code?

Pentium100 · March 1, 2020, 8:42am

I did not see the part in code, but saw from the behavior - IO at 100% (write) and the trash directory filling up at about the same rate as IO write load with the total used space staying about the same.

Vadim · March 1, 2020, 8:45am

then it is point to check. because it is realu undesireed behavior

jorma · March 4, 2020, 7:55am

There is no freed space, still 4TB used as before (according to CLI Dashboard).
And so far I have not a singe byte of traffic this month.

@Pentium100 As I mentioned, the workload came mainly from reading processes, there was almost no writing process.
And I also couldn’t understand why they would move the whole content of a disk into another directory. IMHO this would be a really unfortunate design decision for the following reasons:

knocking out nodes by overwhelming workload decreases the entire network perfomance.
forces hardware to brake earlier by double the read/write cycles.

Why would they do this if they also could simply mark the pieces as deleted (in the database)?

Pentium100 · March 4, 2020, 8:44am

That would not create a lot of IO, if it was done using the equivalent of “mv old new” instead of “cp old new && rm old”. Again, I have not looked at the code, I think this works like that from the way I saw the node behave.

I missed that, sorry. The only time my node gets a lot of read IO is when it is restarted (after the whole VM is restarted so the cache is empty) as it is reading the database.

jorma · March 4, 2020, 10:45am

Guess the restart is causing this behaviour. Thanks for your answer.

In a desperate move I deleted my node and set it up again. This helped before, but not now - now the node is offline. And I can’t see why. Port forwarding is working on another port, but the SN-Port apears closed from outside.
Well, as I read in another post it can take a while until it gets back online. I take a look again after my afternoon study session… (in 6-10 hours)

jorma · March 4, 2020, 10:50am

uh, btw, I put a invalid link in the initial post. It should be uptimerobot.com
…sorry for that…

Alexey · March 4, 2020, 10:03pm

How you did that?
Did you completely removed the data and identity or only docker container or the Windows GUI?

jorma · March 5, 2020, 8:49am

I removed the docker container and set it up again, but this time I added -p 14002:14002 in order to access the gui dashboard. And on the gui dashboard I could see that my node got disqualified on all satelites. That was a hard insight after running it for 9 Months successfully.

My conclusion was that I can’t do anything once it’s disqualified and that I have to take the loss of the withholdings. So I set up a new Node with a Identity and I’m deleting the data now. I however made a copy of the log file, but I actually have no time to really investigate in what happend.
I guess it went partially offline about a week ago when I had to change subnets in my network to make my mining farm run again. But I had the feeling that I did update port forwarding correctly. I also restarted router and Node (synology ds218+) and uptimerobot.com did not report the node to be down (better said the port to be unresponsive)…

Not sure how this happend, but if I remember right I had a similar situation where the port stayed open and reachable from the internet but the node was (at least partially) unresponsive…

jorma · March 5, 2020, 9:01am

But it might have been a bad decision to set up a new identity before deleting the old data. Deleting the data again causes 100% disk workload and as a side effect pretty low upload success rate. storj successRate2

fasttrackhost · March 5, 2020, 10:46am

Please tell me how you get that info in the image, whats the command for it? Thank you

fasttrackhost · March 5, 2020, 10:50am

This could also be your Antivirus software, and if you are on Windows once a month it does a complete system scan for malware, etc.

jorma · March 5, 2020, 10:56am

there’s no antivirus on my synology nas. And the workload also stoped imediatly when I stoped the container.

it´s called successrate .sh … you have to download it. Take a look here. But not sure if there is something like it for windows…
https://forum.storj.io/t/can-t-find-the-successrate-sh-file/4889

BrightSilence · March 5, 2020, 8:24pm

There is:

Alexey · March 5, 2020, 8:39pm

No, the only possible way right now is failing too many audits, i.e. lost or inaccessible data.
How is your docker run looks like? Have you replaced the -v options to the --mount?

jorma · March 5, 2020, 9:57pm

yes, must have been somewhen last year. It’s with the --mount option in my setup script

I am pretty sure there were no failed audits. Although I did not make a screenshot of successrate.sh, I did check it. And there was no sign of an error, despite of the really low upload successrate (as I mentioned in the initial post).
I also checkt the last day with traffic in the copy of the log file and it seems to me as if there was not one failed audit - but plenty of successful audits (1 entry “download startet” followed by 1 entry “downloaded” (category “GET_AUDIT”))
But maybe there were failed audits but for some reason they did not get saved in the logs.

Node ID was: 1MgCzp5ByecEfxUVm3rXpVX5aGXkyoSjBnS7kiBfeduuuLzink