Node getting killed by OS

TrickyZerg · March 10, 2022, 5:28am

Hello,

I started my node project with a new identity roughly 33 hours ago and “docker ps -a” give me the following:

5fa7d6d84a4f
storjlabs/storagenode:latest
“/entrypoint”
Created: 31 hours ago
Uptime: 13 hours

So there was something definitely wrong with its active status, so I went ahead and check the logs - here is the output which I managed to gather using “docker logs” and “docker events”:

For some odd reason a signal from OS was given to close the connection:

2022-03-10T04:57:31.839Z INFO Got a signal from the OS: “terminated”
2022-03-10T04:57:46.846Z WARN servers service takes long to shutdown {“name”: “server”}
2022-03-10T04:57:46.852Z INFO servers slow shutdown

Which has normally resulted in failure of uploading information I suppose, not entirely sure on the below:

2022-03-10T04:49:52.434Z ERROR piecestore failed to add bandwidth usage {“error”: “bandwidthdb: database is locked”, “errorVerbose”: “bandwidt
hdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*bandwidthDB).Add:60\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).beginSaveOrder.
func1:722\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:434\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:220\n\tstorj.io/drpc/dr
pcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcs
erver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52”}

Finally, event logs show the same results:

2022-03-10T06:57:31.843806200+02:00 container kill 5fa7d6d84a4f29e9f7c057780571fda9a0937c79c49820ee36a97781dbfe9957
2022-03-10T06:58:14.601658100+02:00 container kill 5fa7d6d84a4f29e9f7c057780571fda9a0937c79c49820ee36a97781dbfe9957
2022-03-10T06:59:24.011262100+02:00 container kill 5fa7d6d84a4f29e9f7c057780571fda9a0937c79c49820ee36a97781dbfe9957

I have kept the node “stopped” for now, and I would really appreciate if someone can suggest how can I troubleshoot this in depth, as this has occurred more than 2 times and I can’t seem to start the project as normally.

Following the steps, watchtower was also set using the command in the manual, as I’m not entirely sure that the below difference in uptime should bother me:

docker run -d --restart=always --name watchtower -v /var/run/docker.sock:/var/run/docker.sock storjlabs/watchtower NODE1 watchtower --stop-timeout 300s

89b7933297f5
storjlabs/watchtower
“/watchtower NODE1 w…”
15 hours ago
Up 13 hours

Looking forward to your kind response!
Regards!

naxbc · March 10, 2022, 6:07am

Windows Docker, Synology, Linux or RPI?

TrickyZerg · March 10, 2022, 7:48am

Hello @naxbc,

I’m running Windows Docker over the GUI - I operate primarily under PowerShell 7 and the version of my docker is 20.10.12, build e91ed57

Let me know if you need anything else!

naxbc · March 10, 2022, 8:05am

I’m not really an expert on this, but the first thing that came into my mind was if your HDD is SMR and not CMR as would be triggering an I/O overload killing the service for not responding…maybe nonsense.
Let’s wait for expert @Alexey

TrickyZerg · March 10, 2022, 8:47am

Hey @naxbc,

Thank you for the swift reply on this one.

I have configured RAID-0 with 4 HDD 3.5 drives to work in sync, for a total of 2.4TB and honestly I have never though on what they might be in terms of how they record data, I might need to check on that perhaps on their physical label or online.

Would it be a possibility, that my RAID setup is being the problem here?

In the meantime, if anything else comes to mind - please shoot straight away.

naxbc · March 10, 2022, 9:06am

Being RAID-0 is not really the issue; I operate 5 nodes totalling over 40TB and they all are in RAID-0 without any issue so far.
Maybe you can check if it’s SMR by the Part Number.
Also, are you aware if you have bad sectors on them?
Bare in mind of RAID-0 issues: if 1 HDD caputs, the whole node is lost.

SGC · March 10, 2022, 12:57pm

another issue with a raid, is that the io is equal to 1 hdd max, ofc there are some bandwidth advantages and such.
but since all disks need to write 1 io for the raid to store one data block, then it becomes a pretty simple calculation.

naxbc · March 10, 2022, 1:00pm

Adding to it, it always follows the slowest disk.
So, basically if you have 4 HDDs which 3 are 5400rpm…the others will “be” 5400rpm.

Stob · March 10, 2022, 1:41pm

I haven’t used RAID0 in a very long time but this doesn’t sound right. Depending on the method for creating and managing the RAID and also the way in which files are written it could be possible to write to all the drives concurrently, but this would sure take serious tweaking. Even so if you write to a 10k drive instead of a 5.4k drive the speed should be bottlenecked by the RAID ‘controller’ or the drive being written or read from, not another drive in the same RAID0.

I will admit this is the case for RAID1 as you’re mirroring data across drives you are limited by the slowest.

baker · March 10, 2022, 2:22pm

Storj recommends using Docker Desktop CE v2.1.0.5 with windows if your Windows doesn’t support WSL2 or it is not enabled. Not sure if this is your issue or not, so check if WSL2 is enabled.

Any reason you don’t want to use the native Storagenode GUI for windows?

I second the idea that RAID-0 is not ideal as one bad drive will kill the entire node. In my opinion it would be better to run independent nodes on each independent disk. If you did decide to go with the Storagenode GUI, you can only install one node with it and would need to use a workaround like Vadim’s Windows Toolbox to install additional nodes. However the GUI is still the most stable option for a node on Windows.

TrickyZerg · March 10, 2022, 2:33pm

Hey,

Once more, I’m getting amazed by the community here:

@naxbc

I really don’t know if the RAID-0 can be the issue here. I will make sure that a check on the type later on, not at home at the moment. As for the sectors, not sure if it is relevant - but I ran a paid software which ensures that the all 4 of my HDD are getting a low format. They have been used on Desktop PC previously, and they are all with different writing speed. Otherwise, yes, I have chosen RAID-0 because I wanted to combine all 4 of my drives and run them through a single port of my PC. I use an Orico HDD Hub - too bad that the transfer of data is over USB 3.0.

@SGC

Can you clarify a bit about the I/O when running RAID (or specifically RAID-0)

@Stob

Thank you for this explanation, perhaps I may need to switch to another more suitable RAID mode or just to simply run 4 nodes, each using 1 of the 4 HDDs - hopefully, for this I can only use 4 identities and still the same network for all 4.

@baker

Yes, I remember that I ran an upgrade - damn, otherwise the Docker application on my Windows curruntly has WSL2 enabled, and I use Ubuntu 20.X as my distro.

Can you tell me a bit more about this “Storagenode GUI for Windows” or any guide for it?

Thanks guys!

baker · March 10, 2022, 2:54pm

Of course. The documentation can be found here:

If you do decide to migrate, note that the docker/linux versions require the path to include the folder called storage whereas the GUI on a new install does not, so getting paths correct is important. There is a guide for this here:

SGC · March 10, 2022, 8:37pm

in the sense of io, all raids basically act the same, the data is striped across all drives of the raid, and thus each disk will perform 1 io or multiple for each raid data stripe written or read.

the stripe size also plays a role, which is why i said multiple io for each stripe.
lets use your disks as an example… a sector on most hdd’s today is 4k, it required 1 io to write a sector.
so a stripe size of lets say 64k would mean it would write 16k for each of your 4 disks, giving the total of 64k for 1 write… ofc these 16k writes on each disk is then 4 io, but sequential io is pretty fast.

storagenodes does a lot of writes, so mitigating that helps a lot, so ram on your raid controller or a write cache will do wonders for performance.

splitting the disks up would be a much better approach, as you will get the full read write io from each disk, raid isn’t great for storagenodes.
especially if you don’t have a write cache.

what stripe size are you running?

basically raid makes hdd’s act as one… which comes with some disadvantages, especially on the io side of things.

TrickyZerg · March 19, 2022, 12:57pm

Heya, apologies for the delay.

@baker - I may consider setting up the node as such then and try using the GUI instead, I previously shut down my node, so I will just start a new and see what will be the results of that.

@SGC

Thank you for the information - I believe the stripe size 64 KB, following your suggestion on having the splitting of the disks, would there be an option of perhaps using another RAID mode or do you suggest to simply run nodes independently on separate disks.

One question about that, can I have 4 nodes for each one of my disks running on the same computer - is that recommended, as in one other threads I believe that I was told, that I will need a separate network IP for each node setup.

Kind regards,
Rumen A.

SGC · March 19, 2022, 2:10pm

you can have as many nodes as you like, their behavior / ingress will depend upon their global ip’s
the network data distribution is selected by ip/24 subnets.

so basically if you set nodes up on one or more ip’s within a ip/24 subnet they will all get data as one node.
if you have ip’s across different ip/24 routed to each node then the nodes will behave as multiple nodes.

running 1 node on 1 disk is the recommended approach, and should be fine.
ofc raid can give one disk redundancy, to protect against disk failures, but that is really only preferable for older nodes, as it can take years for a node to grow to a decent size.

then it can be nice to know that the node won’t die because a hdd dies.
but for new nodes, running raid is unpractical.

and since your disks seem to have trouble to keep up when running in raid,

i would recommend that you create 4 nodes 1 for each disk, you don’t have to created them right away.
you can simply add new nodes as the existing ones are filled.
and given the size of your disks, you would be fine with 1 or 2 ip addresses.
2.4TB will at the current pace, most likely take like maybe a year to fill or maybe even less.

i got nodes from dec that are at 400GB stored, so thats like 100GB pr month, but early on while they are vetting ingress is reduced.

TrickyZerg · March 19, 2022, 2:43pm

Hey @SGC

Appreciate the swiftness on this one!

And what happens when the assigned capacity of a given node is filled exactly?

Otherwise, if running 4 nodes or 4 disks would that be handled properly in terms of traffic by only 1 IP address - I’m on 100/100 Mbps network plan.

Alexey · March 19, 2022, 2:53pm

4 nodes on 4 different disk behind the one public IP will act as a RAID on the network level - they would spread the common traffic. In simple words - the traffic would be as there only one node.
Each of nodes will be used in small amount, if you start them at once.
So, the best approach is to run only one node, when it will be filled up or at least vetted - start the next one.

When the allocated space will be used at whole - the ingress traffic to this node will stop, but the egress would be exist. The customer can decide to remove their data, then your node would have a free space again and ingress will start again.

TrickyZerg · March 20, 2022, 7:35am

Hello @Alexey,

Understood, thank you for all your help!

TrickyZerg · March 20, 2022, 12:13pm

Just a quick follow up:

I started once again a new with my node, but upon setting it up initially I get:

docker run --rm -e SETUP="true" --mount type=bind,source="/mnt/c/Users/User/Desktop/Identity-Host1/NODE1",destination=/app/identity --mount type=bind
,source="/mnt/d/NODE1",destination=/app/config --name NODE1 storjlabs/storagenode:latest

2022-03-20T10:48:57.428Z        INFO    Configuration loaded    {"Location": "/app/config/config.yaml"}

Error: storagenode configuration already exists (/app/config)

How do I go about getting this (/app/config) removed, so I can deploy the node?

Went through several threads on the matter, nothing seems to be working for me. I’m starting the node through the WSL v2. After formatting my drives I had to mount it (for some strange reason) as WSL was not previously detecting the new drive, did that, so upon listing it I see:

D:              932G  132M  932G   1% /mnt/d

I’m using SSH from my MacBook to connect to my Windows home PC, which will be running the node - I’m not at home sometimes, so this was a good approach to gain access to the machine, not sure if I can only initiate and set up the node through the machine itself, should not matter, right?

Let me know if anything else as information is required.

Regards,
Rumen A.

SGC · March 20, 2022, 12:26pm

the setup=“true” is a once only thing… it creates a new storagenode
basically its to ensure that one cannot overwrite an old storagenode, by accident.

if you moved your storagenode you simply start it from the new location.
by using the command from “running the storagenode” segment in the url below.

looks something like this.
docker run -d --restart unless-stopped --stop-timeout 300 -p 28967:28967/tcp -p 28967:28967/udp -p 127.0.0.1:14002:14002 -e WALLET="0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" -e EMAIL="user@example.com" -e ADDRESS="domain.ddns.net:28967" -e STORAGE="2TB" --mount type=bind,source="<identity-dir>",destination=/app/identity --mount type=bind,source="<storage-dir>",destination=/app/config --name storagenode storjlabs/storagenode:latest

so basically you have to decide if you will keep the old one… or delete it before using the same location.

if you got any decent time on the storagenode i would recommend trying to keep it.
else it doesn’t really matter to much.
it should work just fine even after having days of downtime…
so long as you have the identity and the stored data