Node DQ on satellite - upload rejected, too many requests

majklsee · June 10, 2020, 10:03pm

Hello,

can’t find this error : upload rejected, too many requests

trying to find out why its happening, and why my node was dq, node wasn’t reacting to stop command …

running docker on win10pro

Thank

SGC · June 10, 2020, 10:39pm

upload rejected, too many requests
happens when request exceed the number set in config.yaml.

Should I change max-concurrent-requests? >> Note: The implementation has changed. You shouldn't use this setting any longer. Leave it commented out

There is not one answer for everyone. If you have low powered hardware, you probably shouldn’t. In worst cases you might want to lower it. But if your hardware is fast and your connection is as well, it could be beneficial to raise this number.

How do I change the setting?
# Maximum number of simultaneous transfers
storage2.max-concurrent-requests: 7
Add these lines at the end of the config.yaml and play with the number a bit. You have to stop the node, make the change, then start the node again.

limiting max concurrent shouldn’t make the node unresponsive nor get it DQ
you might have some sort of other issue with the node…
but you could try to increase the max concurrent or just set it to 0 for infinite

majklsee · June 10, 2020, 10:43pm

there was som stucked piece i think, not possible hardware could not handle, didn’t manipulate with config ever…

2020-06-10T22:02:17.274Z ERROR piecestore upload rejected, too many requests {“live requests”: 6678, “requestLimit”: 10}

SGC · June 10, 2020, 10:50pm

yeah getting 6678 requests is pretty insane… the node / system / hdd / or so must have stalled made the node unresponsive…

i’ve tried my node getting unresponsive at times, usually when i do crazy stuff to my system like removing my l2arc drive with a wrong command… everything will slow to a crawl and tho i can execute commands to the node, and use the docker log command, then the node will refuse to shutdown… ended in me forcing the system to do a hard reboot or maybe i pulled the plug… i forget…

not really relevant in your case, but seems similar… you should check that the node hdd isn’t having problems.

Alexey · June 12, 2020, 1:16pm

You should remove this option from the config or comment it out.
The disqualification is currently possible only by failing too many audits.
To fail audit your node should be online, answer on audit request, but unable to give a requested piece either because it doesn’t exist on your node (deleted by operator for a reason or accidentally) or do not have an access to the piece.
In case of timeouts the node will be placed into the containment mode and it will stop be selected for uploads and will be asked for the same piece three more times. If it’s unable to answer - the audit considered as failed.
Too many failed audits in row will lead to disqualification.

Could you please, share your full docker run command?

SGC · June 12, 2020, 1:21pm

actually when the computer stalls the node keeps running and just returns that it doesn’t have access to anything… no amount of settings will fix that… since its a coding issue… and it’s most likely a hardware or smr hdd issue that caused this system to stall … it usually is… when i stall mine anyways lol

should be pretty easy to replicate… you just take a zfs system and remove the cache drives without stating it’s a cache device and that will stall the entire system…

then the node will not respond to shutdown commands, even tho docker will show the logs… and its just screaming in horror xD in the logs

rebooting the system also stalls out… (by that i mean that the debian linux i have will let me use the terminal in this state, but rebooting will never be completed or take like an hour, from when the shutdown or reboot command is given)

its a pretty easy way to kill a node… because the node essentially keeps running, but doesn’t have access to anything so just keeps failing at everything

majklsee · June 16, 2020, 6:13pm

my docker run command

docker run -d --restart unless-stopped -p 28966:28967 -p 127.0.0.1:14003:14002 -e WALLET=“0xXXXX” -e EMAIL="XXX@gmail.com" -e ADDRESS=“XXX” -e BANDWIDTH=“100TB” -e STORAGE=“2.5TB” --mount type=bind,source=“R:\storagenode_00\00_Identity\storagenode”,destination=/app/identity --mount type=bind,source=“R:\storagenode_00\00_Storj”,destination=/app/config --name storagenode_00 storjlabs/storagenode:beta

SGC · June 16, 2020, 6:26pm

i think it basically looks like mine, aside from a few minor irrelevant differences…

what kind of a rig are you running… Host system / Storage + interface

i doubt the behavior you have seen is anything other than either a system stall or a hardware stall or fault, or other similar iowait/iofail causes

majklsee · June 16, 2020, 6:47pm

meanwhile two other satellites were disqualified one on same drive 2nd on other drive/node

got Celeron J4105 4x1.5(2.4)GHz, 16GB ram, and NOT smr drives on sata

donald.m.motsinger · June 16, 2020, 6:50pm

Why do you run 2 nodes on the same hard disk?

SGC · June 16, 2020, 6:55pm

generally you only want one storagenode per drive… the storagenode takes up a good deal of iops which can usually put real strain on a hdd… ofc this supposedly becomes quite different when having multiple nodes on one ip, because the data allotment by satellites are based on ip… and thus the load should be distributed among multiple nodes… which in most cases should ease the pressure put on each drive…

i would recommend you try to limit yourself to 1 node pr drive… and then see what happens… i’m sure others will like to take a look at some of the logs, to try and identify the problem…

but it wouldn’t surprise me if it was due to something like this… especially node reboots can be particular rough on the drives… taking vast amount of IO for extended periods… thus rebooting maybe two nodes at one time which live on the same drive… might cause excessive latency…

if this happens on a regular basis it could kill your nodes… so maybe you want to try and search through some logs, see if you get that

upload rejected, too many requests

log entry often, if so does it often give you those immensely high numbers… because that could indicate an io issue… this should mainly only happen on nodes sharing drives with other nodes.
how iowait can stall your entire system most likely, because the cpu doesn’t like then it looses or almost looses contact with a drive on an internal bus… the sata bus is a critical component and if it says wait, then the entire system might wait for it… depending on how many sata controllers you have and which controller your OS is running from.

majklsee · June 16, 2020, 7:01pm

edited: satellites - got 3 nodes each on special drive

majklsee · June 16, 2020, 7:28pm

sorry for that misunderstanding satellite/node , as i corrected myself, each of 3 nodes (before it was 4 just one died on db problem) running on separate drive, had no problem before, even when cpu usage was higher, got 2 sata controllers, 4 drives are running from onboard controller including os drive

SGC · June 16, 2020, 8:11pm

do you have any sort of system monitoring…
i’m using a program called netdata… it’s not the greatest, but very detailed and easy to use…
it can be a great help in finding problems…

if all the nodes are on separate harddrives and they all seem to have problems… then it seems to indicate that it’s not related to the harddrives themselves nor the cables… which kinda leaves us with maybe memory or general system instability / latency

try to install netdata, that should help you a lot in finding problems now or in the future.

majklsee · June 16, 2020, 8:27pm

as I can see what is it doing now, i think i can blame docker for all troubles, seems really unstable since my problems started just about 10 days ago , after almost year of no problem service

on command now i get, 2nd time in evening:
Error response from daemon: open \.\pipe\docker_engine_linux: The system cannot find the file specified.

SGC · June 16, 2020, 9:45pm

that may be… have you recently updated it… maybe there is a bad version or you need to update it…
also might be worth running through what ever else that might needs to be updated… just to be sure its not some weird version mismatch…thing
ofc start with docker, see if that helps

Alexey · June 16, 2020, 10:07pm

I would like to suggest you to downgrade the Docker desktop for Windows:

majklsee · June 23, 2020, 10:30pm

I ll try to connect in this thread,…

satellites on node were disqualified almost 10 ago and data still on hdd, how i can rid of dead satellites data? Want to preserve last one satellite, so i can’t delete all datas.

Thanks for help

Alexey · June 23, 2020, 10:33pm