Watchtower storj updates in RAID/ZFS arrays suggestion

cml · November 9, 2020, 11:34am

As the majority of people here I’m using watchtower to manage updates and subsequent container restarts. I have a RAID 6 array with multiple disks and 3 nodes, and have noticed (as many others) the filewalker task has considerable impact on the disks if used across all nodes at the same time. One solution would be to sequentially update and restart each storage node when the filewalker task finished or after some time.

We almost have something that can be used to accomplished this in watchtower: rolling restart and lifecycle hooks.

From watchtower documentation:

Rolling restart
Restart one image at time instead of stopping and starting all at once. Useful in conjunction with lifecycle hooks to implement zero-downtime deploy.

Lifecycle hooks
Executing commands before and after updating
It is possible to execute pre/post-check and pre/post-update commands inside every container updated by watchtower.

The pre-check command is executed for each container prior to every update cycle.

The pre-update command is executed before stopping the container when an update is about to start.

The post-update command is executed after restarting the updated container

The post-check command is executed for each container for the post every update cycle.

(…)

Timeouts
The timeout for all lifecycle commands is 60 seconds. After that, a timeout will occur, forcing Watchtower to continue the update loop.

With the above we could have a script that would get a time delay variable from the environment or periodically check if the filewalker task has finished. I have created the following ticket in watchtower repo: Rolling restart delay · Issue #675 · containrrr/watchtower · GitHub.

If the above gets approved we could have a script in storagenode containner that would periodically check if the filewalker task had finished, if so it would finish the post-update script with exit 0 allowing watchtower to update the next container.

Would like to hear your suggestions. Thanks in advance !

EDIT: English is not my first language, will gladly fix any grammar error/typo. Let me know if that’s the case.

joesmoe · November 9, 2020, 12:47pm

I have the exact same problem.

nerdatwork · November 9, 2020, 1:18pm

Do you have 1 watchtower to update all nodes ?

joesmoe · November 9, 2020, 1:36pm

Yes…(20char)

nerdatwork · November 9, 2020, 2:17pm

IIRC there was 1 SNO who had 3 nodes with 1 watchtower command. 2 nodes were updated at same time while 1 was updated after 2 days. I think having 3 separate watchtowers would help update the nodes at random times.

cml · November 9, 2020, 2:24pm

IMO I think it makes more sense to have a single watchtower instance to manage all updates than multiple. @nerdatwork to be clear, I don’t doubt that could work (you still could have overlapping storagenodes running filewalker at the same time), I just think that spawning multiple watchtower instances to manage updates is not a good solution.

nerdatwork · November 9, 2020, 2:27pm

I agree, 1 watchtower should be used for all updates. My comment was a workaround.

Also you labeled your issue as low priority.

cml · November 9, 2020, 2:31pm

Hmm, nice catch. I’m not being able to can change it, I think only repo mods can.

Later today will submit a PR, the code seems pretty straight forward if I take Pre-update timeout code as example.

SGC · November 9, 2020, 3:04pm

according to Storj Labs having multiple nodes on one media isn’t suppose to be allowed in the terms of service since all the nodes is essentially the same node because if the array fails the data from all nodes cannot be recovered.

or that was the statement given by JT.O during the last Townhall… or atleast how i understood what was being said.

disregarding that point tho… yes the filewalker is a fairly demanding task, especially on raid arrays because they have so poor iops.
essentially the larger your raid array becomes the bigger this problem will be, which is why most raid solutions requiring performance ends up being some sort of mirror or like in zfs where your data is load balanced across multiple raidz’s when you have done some expansion of the pool…

where as most conventional raid setups will have to do something like raid 60 to start seeing useful performance benefits when compared to iops of even a single drive…
so basically in conventional raid6 with 33% capacity loss the minimum size, you will need 12 drives for your raid6+0 to perform better than a single drive…

and then to finish it… i think watchtower is being killed off for the linux updater or whatever they are calling it… but haven’t really looked at that much.

this might be the most recent stuff about that…
https://forum.storj.io/t/draft-tech-preview-linux-storage-node-updater/10018

thus far storj seems to have been pretty lenient on people not following the “rules” that almost always end up being called recommendations… but i doubt that will last forever…
so i’m not saying people should trash their setups… but that they should keep in mind when expanding their capacity, that there are “semi new rules” that might be enforced more strictly at some point.

aside from that raid can be a lot of work especially if its not configured correctly for what one is doing.
stripe sizes are also extremely important for optimizing iops, tho opinions on the configurations of stripe sizes don’t always agree, from my own testings on conventional raid using lsi 9280 i believe my best performing stripe size was 256k, but back when i tested that storj was still on V2 so may not be applicable any more, but most recommend about 128k because it fits well with most file sizes used today… atleast for best avg iops performance on a wide range of workloads.

for capacity i’m not sure… i would expect around the size of the smallest files, but i’m sure there are some math for that… raid is difficult if one doesn’t have a guide to follow or experience about how to run / configure it optimally.

sorry for the rant, but check out the linux updater stuff it might be very relevant…
and felt it was useful to outline why your setup isn’t working and the future problems of such a setup.

deathlessdd · November 9, 2020, 3:33pm

I agree with this, You really shouldn’t run more then one node on a single raid 1 node is fine but more then 1 node your really going to raise the chance of loosing 3 nodes instead of 1 node. Its recommended to run nodes on its own media, using raids aren’t even recommended but of course its the best option to max out the size of the node. Plus theres no benefits of running more nodes on a single raid so your really going to hurt yourself in the end.

4.1.4. You will provide and maintain the Storage Node so that, at all times, it will meet the following minimum requirements (“ Minimum Storage Node Requirements ”):
- 4.1.4.1. Have a minimum of one (1) hard drive and one (1) processor core dedicated to each Storage Node;

kevink · November 9, 2020, 4:26pm

I disagree. The only reason I can see for not running multiple nodes on a raid system is if it slows down your system when the filewalker is running.
For me personally, I run a raidz1 with 3 drives and the filewalker process for 3 nodes even simultaneously does not have any noticeable impact on nodes or system operations (the system itself however runs from an ssd, not the raidz1).

Apart from that, the remaining arguments are imho not correct.

While the argument itself is of course true, you could lose 3 instead of 1 node, it is irrelevant. Because the amount of data you lose is the same and since all 3 nodes are in the same subnet, no duplicate pieces are stored on those nodes. So 3 nodes on the same array in the same subnet is exactly the same as having 1 node with the same amount of data.

I can’t see where you are hurting yourself but I can see a couple benfits:

If I need some space, I can shut down a smaller node.
If I want to expand, I can migrate a node to my new HDD/array and start with a node that gives full profit already
Should the unfortunate scenario happen that someone else is running a node in my subnet, I have more “power” and will get a bigger share of the ingress thatn I would get with only one node

(Sorry for jumping in on this off-topic discussion)

Odmin · November 9, 2020, 4:26pm

I can propose use 3 watchtower instances at the same time, each instance will have a different time period for check update for each node, and it will be the easiest solution for your problem.

deathlessdd · November 9, 2020, 4:56pm

While this is true not everyone has the intention to run all 3 nodes on same “subnet” So if lets say someone has 3 isp and or run a VPN for each node to get there nodes on a different subnet so they can Of course try to get more data.

Im not saying that is what is happening here, It is against the terms of services to do so anyways. So if your node does fail its all on you.

Doom4535 · November 9, 2020, 9:16pm

I don’t believe this is correct, it is my understanding that they base the allotted priority based on IPV4 address, so all your nodes would be treated like one giant node by the satellite (unless of course you have multiple external IP addresses).

kevink · November 9, 2020, 9:37pm

you can read up on that in some other threads.
the short version: ingress is split in /24 subnets. And within those subnets a random node is selected. So if you hold 3 of the 4 nodes in your subnet, you get 75% of all ingress. So your neighbour gets 25% of the ingress.
(Basically all nodes within a /24 subnet are considered “one node” for the initial ingress distribution. Doesn’t matter if the node is in your neighbours house or 3 in your own.)

Pac · November 9, 2020, 10:18pm

That’s indeed my understanding too.
Feels a bit unfair to SNOs by the way, to be honest…

Doom4535 · November 10, 2020, 2:56am

Fascinating, put that under the TIL’ed category. I guess it does simplify it on the satellite side, I do feel like treating each IP as separate would be a little better for network robustness though?

kevink · November 10, 2020, 5:57am

no, this is done to keep data geographically distributed. You don’t want your neighbourhood storing 10 out of 40 remaining pieces because if your neighborhood goes down because of a problem in the electrical grid, the file will be unaccessible.
So storjlabs uses an /24 subnet distribution. Not perfect but so far the best system anyone could come up with.