Your Node is Suspended nothing obviously wrong?

MrYoshii · February 28, 2022, 10:19am

Your Storage Node on the us-central-1 Satellite was suspended because it produced errors too often during audits.

You were suspended on 2022-02-28 at 00:26 UTC.

Your Storage Node on the europe-west-1 Satellite was suspended because it produced errors too often during audits.

You were suspended on 2022-02-27 at 22:35 UTC.

I can’t imagine what the reason could be. My node has been running without problems for years.

thank you for your help

SGC · February 28, 2022, 10:48am

your suspension score looks fine now, if that is a current node dashboard.
might need to check your logs if you want to find out exactly what happened…

but i do know there has been a lot of brief suspensions going around lately…

however i doubt it happened without some sort of issue… but it does seem gone now

you can find a lot of good suggestions to how to check your logs here.

StoreMe2 · February 28, 2022, 11:37am

I got suspension too on US1 then i updated Linux and restarted the machine. After that i got suspension information for 2 additional nodes. But until restart everything looked fine on dashboard. So dashboard sometimes don’t show the information of suspension.

In my logs there were only 1 Error “ERROR piecestore upload rejected, too many requests”. So i changed the config.yaml: (First stop node then change then restart node)

# Maximum number of simultaneous transfers
storage2.max-concurrent-requests: 200

My Node was set to 40. I think that after an update the max requests was set automatic to 40 and cause this error. Everything worked find for years for me too. I think that there are a lot of people facing this issue. You have to watch the CPU work level but in my case my PC has enough HDD and CPU power to handle a lot of requests.

And now after 10 minutes of running node my suspension level went from 95 % to 100 % back again. Very strange behaviour for the dashboard. I think the dashboard is not showing the correct information.

MrYoshii · February 28, 2022, 12:59pm

Where do you display this information? i use storj docker.

SGC · February 28, 2022, 1:19pm

generally i think config.yaml
storage2.max-concurrent-requests: 0
should be at 0 for unlimited…
duno if they changed that default / recommendation… its possible
config.yaml is in the storj folder

the max concurrent can be used for help a node run better, but it does also come with some limitations.

you should rarely run into the limit… and 200 should be more than plenty, but there has been a lot of traffic recently, so who knows…

used max concurrent’s early on because my storage couldn’t keep up, but these days i just run at 0

MrYoshii · February 28, 2022, 1:32pm

This entry is commented out

# storage2.max-concurrent-requests: 6

SGC · February 28, 2022, 6:40pm

seems a lot of people have suspension issues in recent days… so long as your suspension score jump back to 90% + fast then i doubt its anything really wrong…

maybe the internet is fuckered because of the whole ukraine war thing, and thats what is causing it…

seems fine, keep an eye on if it keeps happening or keeps getting worse…
but i wouldn’t worry.

MrYoshii · February 28, 2022, 6:53pm

i got a uptimerobot but no warning there also on storjnet.info nothing was down…

Damyanu · February 28, 2022, 9:51pm

Same thing here… strange

kosti11 · March 1, 2022, 10:44am

Me too was suspended for a while on USA.
Now it’s 100%.

MrYoshii · March 1, 2022, 5:49pm

Ok, both satellites are now back to 100%.

MrYoshii · March 6, 2022, 5:28pm

Something happend again?

MrYoshii · March 6, 2022, 9:32pm

now after 4h its back up again?

is it a storj problem or on my side?

Alexey · March 7, 2022, 2:25am

The suspension score and audit score can be affected not only by GET_AUDIT, but also GET_REPAIR.
Please search for errors in your logs related to GET_AUDIT or GET_REPAIR.
See https://support.storj.io/hc/en-us/articles/360042257912-Suspension-mode

There are few differences between GET_AUDIT and GET_REPAIR. The GET_AUDIT would request only small amount of data just enough to check the node. The GET_REPAIR is a full download of requested piece. If your node returns some unknown error instead of piece - the suspension score will be affected. If the error is known, i.e. “file not found”, “disk i/o”, etc. the audit score will be affected.

node1 · March 10, 2022, 10:10pm

Hello. Today i’ve received email saying: Your Storage Node on the europe-west-1 Satellite was suspended because it produced errors too often during audits.

On the node’s dashboard i’ve seen the same. But now, (a few hours later), i don’t see this warning in my node dashboard. it’s gone and looks like i don’t see europe-west-1 satellite

feel’s like i’ve missed something

Alexey · March 11, 2022, 5:36am

It’s eu1 satellite. And quick suspension and quick unsuspension are happening these days quite often - we increased the repair threshold, so if your node is answering with unknown error instead of the piece requested by GET_REPAIR, your node could be suspended.
As soon as your node would be able to answer on GET_AUDIT and GET_REPAIR without errors - it will went out of suspension with each successful transfer.

Pac · April 6, 2022, 6:02am

I just faced the same problem I think:

Got the email saying “Your Storage Node on the europe-west-1 Satellite was suspended because it produced errors too often during audits”, during last night.
How are we supposed to know it corresponds to " eu1.storj.io"? Shouldn’t the notification tell us the satellite URL so we can find it on nodes’ dashboards?
I woke up this morning, 4h later, and the node’s dashboard shows no sign of suspension:

image838×315 22.9 KB
Checked the log, but it contains absolutely nothing during april (logs are redirected to files):
```
$ grep -E "GET_AUDIT|GET_REPAIR" node.log | grep  failed
$
```

I do have errors for March though, that happened a few times:

2022-03-06T20:58:50.273Z	ERROR	piecestore	download failed	{"Piece ID": "GAND7A7SG6KUR7JOFJEHXJKCUO6PZOY36XO3UXRC6RO4JY7RQX2Q", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_REPAIR", "error": "pieces error: PieceHeader framing field claims impossible size of 9242 bytes", "errorVerbose": "pieces error: PieceHeader framing field claims impossible size of 9242 bytes\n\tstorj.io/storj/storagenode/pieces.(*Reader).GetPieceHeader:275\n\tstorj.io/storj/storagenode/pieces.(*Store).GetHashAndLimit:474\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:563\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}
[...]
2022-03-23T16:34:12.971Z	ERROR	piecestore	download failed	{"Piece ID": "5FQXOQD5MWSENXSABDOQ4JAIAO4LSXOFPEMUPZ5EJQNSSM5TXYBA", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET_REPAIR", "error": "used serial already exists in store", "errorVerbose": "used serial already exists in store\n\tstorj.io/storj/storagenode/piecestore/usedserials.insertSerial:263\n\tstorj.io/storj/storagenode/piecestore/usedserials.(*Table).Add:117\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:76\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:497\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}

The sure thing is that my HDD is having a hard time these days, its IO being constantly around 100%, so it doesn’t always respond correctly or in time to requests. I did stop the node for running a full check with fsck and badblock on it, and everything seems fine.
It is an SMR drive, so maybe it simply cannot keep up with these days Storj activity, I’m not sure.

Am I right to consider that my Node is (kinda) fine?

Stob · April 6, 2022, 8:31am

Hi @Pac
The dashboard screenshot looks as if everything is fine now. As you can’t see the issue in the node logs it could be worth checking the node API which pulls the audit history from the satellites - http://localhost:14002/api/sno/satellite/12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs

deathlessdd · April 6, 2022, 10:50am

This kinda look likes the same issue that I am having but it recovers before it gets suspended. Topic is here.

Pac · April 6, 2022, 12:17pm

@Stob Thx!

It does return some missing online counts in the audit history (which makes sense as I did take it offline a few times for running disk checks), but appart from that, I don’t see any evident issue:

{"id":"12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs","storageDaily":[{"atRestTotal":8372248384572.656,"intervalStart":"2022-04-01T00:00:00Z"},{"atRestTotal":7388102468921.201,"intervalStart":"2022-04-02T00:00:00Z"},{"atRestTotal":8112510497488.024,"intervalStart":"2022-04-03T00:00:00Z"},{"atRestTotal":7143899182297.52,"intervalStart":"2022-04-04T00:00:00Z"},{"atRestTotal":5022002692049.323,"intervalStart":"2022-04-05T00:00:00Z"},{"atRestTotal":6502203348307.8125,"intervalStart":"2022-04-06T00:00:00Z"}],"bandwidthDaily":[{"egress":{"repair":1616619264,"audit":7936,"usage":482670336},"ingress":{"repair":852674560,"usage":1289124608},"delete":0,"intervalStart":"2022-04-01T00:00:00Z"},{"egress":{"repair":1825188352,"audit":9216,"usage":299407360},"ingress":{"repair":1137851904,"usage":1455997952},"delete":0,"intervalStart":"2022-04-02T00:00:00Z"},{"egress":{"repair":978925824,"audit":6144,"usage":123651328},"ingress":{"repair":574622976,"usage":969064704},"delete":0,"intervalStart":"2022-04-03T00:00:00Z"},{"egress":{"repair":361095424,"audit":2048,"usage":88493824},"ingress":{"repair":340603392,"usage":603735552},"delete":0,"intervalStart":"2022-04-04T00:00:00Z"},{"egress":{"repair":666686976,"audit":7424,"usage":153439488},"ingress":{"repair":571004160,"usage":1116277248},"delete":0,"intervalStart":"2022-04-05T00:00:00Z"},{"egress":{"repair":381833472,"audit":1792,"usage":51812352},"ingress":{"repair":269208064,"usage":453984768},"delete":0,"intervalStart":"2022-04-06T00:00:00Z"}],"storageSummary":42540966573636.54,"bandwidthSummary":16664008448,"egressSummary":7029858560,"ingressSummary":9634149888,"currentStorageUsed":312524718592,"audits":{"auditScore":1,"suspensionScore":0.9999888372680281,"onlineScore":0.9679163340182094,"satelliteName":"eu1.storj.io:7777"},"auditHistory":{"score":0.9679163340182094,"windows":[{"windowStart":"2022-03-07T00:00:00Z","totalCount":800,"onlineCount":800},{"windowStart":"2022-03-07T12:00:00Z","totalCount":992,"onlineCount":992},{"windowStart":"2022-03-08T00:00:00Z","totalCount":945,"onlineCount":945},{"windowStart":"2022-03-08T12:00:00Z","totalCount":1839,"onlineCount":1839},{"windowStart":"2022-03-09T00:00:00Z","totalCount":2101,"onlineCount":2101},{"windowStart":"2022-03-09T12:00:00Z","totalCount":2068,"onlineCount":2068},{"windowStart":"2022-03-10T00:00:00Z","totalCount":1980,"onlineCount":1980},{"windowStart":"2022-03-10T12:00:00Z","totalCount":1946,"onlineCount":1946},{"windowStart":"2022-03-11T00:00:00Z","totalCount":1519,"onlineCount":1519},{"windowStart":"2022-03-11T12:00:00Z","totalCount":1688,"onlineCount":1688},{"windowStart":"2022-03-12T00:00:00Z","totalCount":1783,"onlineCount":1783},{"windowStart":"2022-03-12T12:00:00Z","totalCount":1889,"onlineCount":1889},{"windowStart":"2022-03-13T00:00:00Z","totalCount":1832,"onlineCount":1832},{"windowStart":"2022-03-13T12:00:00Z","totalCount":1644,"onlineCount":1644},{"windowStart":"2022-03-14T00:00:00Z","totalCount":1506,"onlineCount":1506},{"windowStart":"2022-03-14T12:00:00Z","totalCount":1546,"onlineCount":1546},{"windowStart":"2022-03-15T00:00:00Z","totalCount":1939,"onlineCount":1939},{"windowStart":"2022-03-15T12:00:00Z","totalCount":1542,"onlineCount":1542},{"windowStart":"2022-03-16T00:00:00Z","totalCount":1847,"onlineCount":1847},{"windowStart":"2022-03-16T12:00:00Z","totalCount":1880,"onlineCount":1880},{"windowStart":"2022-03-17T00:00:00Z","totalCount":2134,"onlineCount":2134},{"windowStart":"2022-03-17T12:00:00Z","totalCount":1964,"onlineCount":1960},{"windowStart":"2022-03-18T00:00:00Z","totalCount":1725,"onlineCount":1725},{"windowStart":"2022-03-18T12:00:00Z","totalCount":1796,"onlineCount":1796},{"windowStart":"2022-03-19T00:00:00Z","totalCount":1854,"onlineCount":1854},{"windowStart":"2022-03-19T12:00:00Z","totalCount":1964,"onlineCount":1964},{"windowStart":"2022-03-20T00:00:00Z","totalCount":1867,"onlineCount":1867},{"windowStart":"2022-03-20T12:00:00Z","totalCount":1865,"onlineCount":1865},{"windowStart":"2022-03-21T00:00:00Z","totalCount":1677,"onlineCount":1677},{"windowStart":"2022-03-21T12:00:00Z","totalCount":1840,"onlineCount":1840},{"windowStart":"2022-03-22T00:00:00Z","totalCount":1654,"onlineCount":1654},{"windowStart":"2022-03-22T12:00:00Z","totalCount":1709,"onlineCount":1709},{"windowStart":"2022-03-23T00:00:00Z","totalCount":1853,"onlineCount":1853},{"windowStart":"2022-03-23T12:00:00Z","totalCount":1956,"onlineCount":1956},{"windowStart":"2022-03-24T00:00:00Z","totalCount":1967,"onlineCount":1967},{"windowStart":"2022-03-24T12:00:00Z","totalCount":1848,"onlineCount":1848},{"windowStart":"2022-03-25T00:00:00Z","totalCount":1805,"onlineCount":1805},{"windowStart":"2022-03-25T12:00:00Z","totalCount":1844,"onlineCount":1844},{"windowStart":"2022-03-26T00:00:00Z","totalCount":1732,"onlineCount":1732},{"windowStart":"2022-03-26T12:00:00Z","totalCount":1745,"onlineCount":1745},{"windowStart":"2022-03-27T00:00:00Z","totalCount":1699,"onlineCount":1699},{"windowStart":"2022-03-27T12:00:00Z","totalCount":1612,"onlineCount":1612},{"windowStart":"2022-03-28T00:00:00Z","totalCount":1735,"onlineCount":1735},{"windowStart":"2022-03-28T12:00:00Z","totalCount":1539,"onlineCount":1539},{"windowStart":"2022-03-29T00:00:00Z","totalCount":1807,"onlineCount":1807},{"windowStart":"2022-03-29T12:00:00Z","totalCount":1910,"onlineCount":1910},{"windowStart":"2022-03-30T00:00:00Z","totalCount":1858,"onlineCount":1858},{"windowStart":"2022-03-30T12:00:00Z","totalCount":1546,"onlineCount":1546},{"windowStart":"2022-03-31T00:00:00Z","totalCount":1870,"onlineCount":1870},{"windowStart":"2022-03-31T12:00:00Z","totalCount":1954,"onlineCount":1954},{"windowStart":"2022-04-01T00:00:00Z","totalCount":1683,"onlineCount":1678},{"windowStart":"2022-04-01T12:00:00Z","totalCount":1727,"onlineCount":1727},{"windowStart":"2022-04-02T00:00:00Z","totalCount":1855,"onlineCount":1822},{"windowStart":"2022-04-02T12:00:00Z","totalCount":1932,"onlineCount":1832},{"windowStart":"2022-04-03T00:00:00Z","totalCount":1876,"onlineCount":1876},{"windowStart":"2022-04-03T12:00:00Z","totalCount":614,"onlineCount":98},{"windowStart":"2022-04-04T00:00:00Z","totalCount":14,"onlineCount":0},{"windowStart":"2022-04-04T12:00:00Z","totalCount":695,"onlineCount":688},{"windowStart":"2022-04-05T00:00:00Z","totalCount":559,"onlineCount":559},{"windowStart":"2022-04-05T12:00:00Z","totalCount":804,"onlineCount":804},{"windowStart":"2022-04-06T00:00:00Z","totalCount":552,"onlineCount":552}]},"priceModel":{"EgressBandwidth":2000,"RepairBandwidth":1000,"AuditBandwidth":1000,"DiskSpace":150},"nodeJoinedAt":"2020-08-11T15:01:06.336434Z"}

Well, hopefully it was just a temporary suspension ^^