Ran out or Ram / Swapping - Almost 24 hours ago ... on Synology

tankmann · December 15, 2019, 7:57am

Hey guys,

I’ve submitted a ticket already but was wondering if I’m the only one having that issue.
Just came downstairs wanting to check the dashboard and it showed paused on one node but didnt open properly. Then checked the Synology and realized something is off, almost 24 hours ago Ram usage spiked and it began to swap plus traffic went down… took the whole night to fill until it swapped and system almost stalled.

I stopped the node even though almost not responding, after couple of minutes swap went down but some was still there so I rebooted the Synology, came back nicely and now everything is good again. Never had any mem issues before so could be the latest update … But I also have nothing else running on the Synology beside timemachine backups and none of my computers did it overnight…
Logfiles also look good and no audit issues or anything to see.
Anyone else seeing weird behaviour around that time?

As said opened a ticket to get unblocked on the us and asia node but wanted to see if I’m the only one seeing that behaviour during that time.

Thanks!

Alexey · December 15, 2019, 9:34am

Please, publish your audit score:

Also, please, search for errors in your logs.

Your node can fail audits in a few cases:

your node managed to lost customers data;
your node is unable to send an audit response three more times in the containment mode.

tankmann · December 15, 2019, 10:11am

Thanks for the quick reply. So used the query linked and for the last 3 days it shows no audit issues.

Fetching satellite audits stat information. Please wait…
Sat ID: :
Unrecoverable Failed Audits: 0
Recoverable Failed Audits: 0
Successful Audits: 126

As also replied on the ticket I think there’s a big difference between:

Loosing data due to hardware or these issues (actually loosing data)
Being offline (which for me is the equivalent of in my case the node was still running but due to swapping etc. it became more or less unresponsive).

That’s also why my monitoring script didnt send any alarm (as process was still running).
I’ve now setup alarms for swap etc.

Also - disqualification is permanent now - I do understand in general for data loss but not for ‘offline’, but I know this is being discussed in the other threads as well (why disqualify a node if data is fully there but home DSL connection as an example was off for some hours)…

tankmann · December 15, 2019, 10:17am

Dashboard says this for on satellite:

And the other satellite:

tankmann · December 15, 2019, 10:25am

Oh and now with the API calls:

curl -s 127.0.0.1:14002/api/dashboard | jq .data.satellites
[
  {
    "id": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs",
    "url": "europe-west-1.tardigrade.io:7777",
    "disqualified": null
  },
  {
    "id": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S",
    "url": "us-central-1.tardigrade.io:7777",
    "disqualified": "2019-12-15T06:13:43.091462Z"
  },
  {
    "id": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW",
    "url": "satellite.stefan-benten.de:7777",
    "disqualified": null
  },
  {
    "id": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6",
    "url": "asia-east-1.tardigrade.io:7777",
    "disqualified": "2019-12-15T02:12:28.208109Z"

And here in details for the two disqualified ones:

tankmann@Gynology:/volume1/Storj$ curl -s 127.0.0.1:14002/api/satellite/12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S | jq .data.audit
{
  "totalCount": 14187,
  "successCount": 13669,
  "alpha": 11.974738784767581,
  "beta": 8.02526121523242,
  "score": 0.598736939238379
}
tankmann@Gynology:/volume1/Storj$ curl -s 127.0.0.1:14002/api/satellite/12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S | jq .data.uptime
{
  "totalCount": 101404,
  "successCount": 100544,
  "alpha": 99.99999998313228,
  "beta": 1.6867607852660488e-08,
  "score": 0.9999999998313239
}
tankmann@Gynology:/volume1/Storj$ curl -s 127.0.0.1:14002/api/satellite/121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 | jq .data.audit
{
  "totalCount": 345,
  "successCount": 318,
  "alpha": 11.974770198760325,
  "beta": 8.02526121523242,
  "score": 0.5987375695011331
}
tankmann@Gynology:/volume1/Storj$ curl -s 127.0.0.1:14002/api/satellite/121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 | jq .data.uptime
{
  "totalCount": 48130,
  "successCount": 47364,
  "alpha": 99.99999999999734,
  "beta": 2.5259776519562494e-12,
  "score": 0.9999999999999747
}

Not sure I can read the stats correctly.

tankmann · December 15, 2019, 12:37pm

And now I read way more (I’m quite active here but really missed the details, not sure why) around the alpha and beta scores. Used the other script:

tankmann@Gynology:/volume1/Storj$ ./dashj.sh 
node id : 1255c3yrzm1F31j5RkWhHZMfntrRWsSBEvcRmduq4Lxsg9FFbwo 
version : 0.27.1 (latest) ping: 1548000 seconds ago

bandwidth avail: 9313.226GB used: 841.872GB 
     disk avail: 9313.226GB used: 1709.202GB 

sat-id audits    score    uptime    score    
118UWp 16298/16127 0.841390 93919/92551 0.991399 
       storage total: 607138.560GBhrs lastday: 2019-12-14  9252.892GBhrs 
     bandwidth total:   725.160GB    lastday: 2019-12-15  5698.358MB(in) 7730.343MB(out)  
121RTS  345/318  0.598738 48134/47368 1.000000 
       storage total:  2124.880GBhrs lastday: 2019-12-15   193.963GBhrs 
     bandwidth total:    34.295GB    lastday: 2019-12-14     0.000MB(in) 3674.734MB(out)  
12EayR 14187/13669 0.598737 101408/100548 1.000000 
       storage total:  5323.518GBhrs lastday: 2019-12-15   260.741GBhrs 
     bandwidth total:    47.347GB    lastday: 2019-12-15     0.000MB(in)   0.750MB(out)  
12L9ZF 20395/19852 0.789686 84714/82712 0.991485 
       storage total: 102977.578GBhrs lastday: 2019-12-15  2219.733GBhrs 
     bandwidth total:    35.069GB    lastday: 2019-12-15     0.000MB(in) 1409.295MB(out)

But I still don’t understand the scores vs the audits.
Foe example second satelite: 318 of 345 audit checks ‘ok’ which for me is more 92% … the score shows 0.59 (is that 59%?) - I read under 0.6 it disqualifies, ok got it… but as my process was hanging it wasnt communicating but data should be still there / and no ‘lost data’.

Also my past understanding was that audit is verifying if the file is there correctly, but the descriptions on the webinterface say:

Uptime Checks: Uptime checks occur to make sure your node is still online. This is the percentage of uptime checks you’ve passed.
Audit Checks: Percentage of successful pings/communication between the node & satellite.

So none of the checks have anything to do with file integrity right?

Sorry guys but I’m confused, apologies for posting my own progress here but after I switched from my raspberry Pi3 setup to the Synology I though I was doing GREAT…

Alexey · December 15, 2019, 1:29pm

Score calculated as described there: https://github.com/storj/storj/blob/cb894965698cea63dd64cde98bd6e1f188dc2c2b/docs/blueprints/node-selection.md
It’s falling fast, if node has missed data.
The check percentage it’s a percentage of successfully checks for node lifetime.
So the score is more precise. If your node managed to lost data, its audit score will fall within minutes. The percentage can even don’t notice the difference.
For disqualification we uses an audit score.

anon27637763 · December 15, 2019, 1:51pm

There’s a bit of background information as well as mathematical work required to extract the applicable equation from that document.

The uptime component is currently being ignored for calculating the score.
The score listed in the API is the Total Uplink Reputation

Given these two, non-obvious pieces of information… which require one to have followed along in this forum to understand… the applicable equation for manually calculating a node’s score from the numbers presented in the API is:

Total Uplink Reputation = uptime_uplink_weight · uptime R(n) + audit_uplink_weight · audit R(n)

However, the uptime R(n) is ignored, so the equation becomes simply:

audit_uplink_weight · audit R(n)

Substituting back in…

audit_uplink_weight * alpha / (alpha+beta)

The weight is 1.0 because the uptime component is 0.0

Thus the simple calculation extracted for the current reputation is:

Reputation = alpha / (alpha+beta)

For the numbers posted by @tankmann for the 121RTS satellite :

11.974770198760325 / (11.974770198760325+8.02526121523242) = 0.5987375695011331

Unfortunately, this is a disqualifying score. Strangely, if the --erroneously calculated (IMO)-- uptime measurement was included in the reputation calculation, the weight of missing audits would be less than 1.0 … such situation would help raise the reputation score of this particular node, since the uptime is more stable than the audit score.

The labels alpha and beta in the API output are extremely confusing due to the fact that there are alpha and beta releases of the SN software… It’s not clear at all that the alpha and beta numbers are variables in the score calculation.

Alexey · December 15, 2019, 1:54pm

Yes, this is technical document and you can see how this score is calculated. To retrieve this information you can request the API and get a calculated value.

tankmann · December 15, 2019, 2:31pm

Ok thanks for all the details - As I was using the script “successrate.sh” to check and never saw any issues in there in terms of audits failed etc. I thought I’m running perfectly.
Actually if there’s anything going wrong would be good to count whatever errors are there and show it in the dashboard - this has caught me by BIG suprise.

Anyway - I downloaded the logs just in case as in my setup they’re not staying there forever but what’s your recommendation to look through the logs? If I grep for error I get only these types (context cancelled which I thought is just me being too slow - hence ignored them)

|2019-12-15T14:12:35.501Z|INFO|piecestore|upload failed|{"Piece ID": "SEB5SJPRTGHAS3PWGELWDCZPFDB4V3HPVMDQB6JZSME6MBKN64FQ", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "PUT", "error": "piecestore protocol: context canceled", "errorVerbose": "piecestore protocol: context canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doUpload:344\n\tstorj.io/storj/storagenode/piecestore.(*drpcEndpoint).Upload:223\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func1:985\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}|
|---|---|---|---|---|
|2019-12-15T14:18:41.294Z|INFO|piecestore|upload failed|{"Piece ID": "X36WGTXT7OOT44RPQZB67USU6JM4OL6DDFN5A35MXZNYZ44YLQ7A", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "PUT", "error": "piecestore protocol: context canceled", "errorVerbose": "piecestore protocol: context canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doUpload:344\n\tstorj.io/storj/storagenode/piecestore.(*drpcEndpoint).Upload:223\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func1:985\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}|
|2019-12-15T14:19:18.879Z|INFO|piecestore|upload failed|{"Piece ID": "6UHRELW6GXSAQMKEB4RYYADSGHLBVM4ZYWPK3NAQTW3QNOABT24A", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "PUT", "error": "piecestore protocol: context canceled", "errorVerbose": "piecestore protocol: context canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doUpload:344\n\tstorj.io/storj/storagenode/piecestore.(*drpcEndpoint).Upload:223\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func1:985\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}|
|2019-12-15T14:29:06.571Z|INFO|piecestore|upload failed|{"Piece ID": "6CC4WYHWJR6ZXCST4TSJZH2XAQAKIRCDZIJB3VCH73DZYYPXIAMQ", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "PUT", "error": "piecestore protocol: context canceled", "errorVerbose": "piecestore protocol: context canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doUpload:344\n\tstorj.io/storj/storagenode/piecestore.(*drpcEndpoint).Upload:223\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func1:985\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}|

To download error logs I used the synology download csv option as it was by day (within the last 3 days it happened), uploaded them here:

Any idea what to do?
Thanks

Alexey · December 15, 2019, 3:46pm

Those are usual context canceled errors when your node was not fast enough and customer cancel upload, because competitors finished earlier.

Alexey · December 15, 2019, 3:52pm

Please, try to search such errors:

docker logs storagenode 2>&1 | grep AUDIT | grep failed | tail

tankmann · December 15, 2019, 7:33pm

Hi Alexey,

thanks - I’m really checking my logs from time to time, there’s nothing in there. Standard log level…
It doesn’t spit out anything… Or maybe I’m just stupid at the moment

tankmann@Gynology:/$ sudo docker logs storagenode 2>&1 | grep AUDIT | grep failed | tail
tankmann@Gynology:/$

tankmann · December 15, 2019, 7:45pm

So three things I don’t understand:

In the successrate.sh I’ve never saw any errors / failed audits (is that because the script doesnt work properly any more ?)
In the uptime / audit checks that the webinterface shows the numbers are ‘okayish’ no, except the asia one which is lower, how am supposed to see that something is broken here?!
satellite: uptime checks // audit checks
us-central: 99.2% // 96.3% (now paused)
stefan-benten: 98.5% // 99.0%
asia-east: 98.4% // 92.2% (now paused)
europe-west: 97.6% // 97.3%
I’ve randomly browsed (like once or twice a week) through the logfiles and never seen any other error than context cancelled so too slow…

And I’ve just ran the dashj.sh again but what’s striking is the score is now 1.000 (from 0.841390) for the first and 0.972972 (from 0.789686) for the last (both ones that are not paused…) just in a couple of hours? I dont get it

tankmann@Gynology:/$ ./volume1/Storj/dashj.sh 
node id : 1255c3yrzm1F31j5RkWhHZMfntrRWsSBEvcRmduq4Lxsg9FFbwo 
version : 0.27.1 (latest) ping: 982801 seconds ago

bandwidth avail: 9313.226GB used: 859.529GB 
     disk avail: 9313.226GB used: 1718.445GB 

sat-id audits    score    uptime    score    
118UWp 16318/16147 1.000000 93947/92579 0.993509 
       storage total: 646863.319GBhrs lastday: 2019-12-14 48977.651GBhrs 
     bandwidth total:   741.525GB    lastday: 2019-12-15 13905.325MB(in) 16125.785MB(out)  
121RTS  345/318  0.598738 48142/47376 1.000000 
       storage total:  2457.168GBhrs lastday: 2019-12-15   526.252GBhrs 
     bandwidth total:    34.295GB    lastday: 2019-12-14     0.000MB(in) 3674.734MB(out)  
12EayR 14187/13669 0.598737 101416/100556 1.000000 
       storage total:  5770.579GBhrs lastday: 2019-12-15   707.802GBhrs 
     bandwidth total:    47.347GB    lastday: 2019-12-15     0.000MB(in)   0.750MB(out)  
12L9ZF 20435/19892 0.972972 84762/82760 0.994744 
       storage total: 106828.690GBhrs lastday: 2019-12-15  6070.845GBhrs 
     bandwidth total:    36.362GB    lastday: 2019-12-15     0.000MB(in) 2697.411MB(out)

Alexey · December 16, 2019, 4:46am

As I said, the audit score react extremely fast on any move in data. If you lost noticeable amount of data, it will quickly reduced to below 0.6
The same is happening when one will run a clone of the node, it will be disqualified within minutes.

By the way, @tankmann did you stop your raspberry Pi3?

On the dashboard you can see a check percentage for the lifetime of node. This is just percentage of successfully checks from a total amount. You can see them in the API too. It’s successCount/totalCount*100. It’s not an audit score.
I saw suggestions to show the score here:

But it’s not taken into account yet. Better to vote for such idea there: https://ideas.storj.io

tankmann · December 16, 2019, 10:44am

Of course, raspberry Pi3 was stopped, copied over and then ‘killed’ But yes, good question - unfortunately not the answer.
So for my case as the process hung / swapped it probably just didn’t respond at all which then lead to the score below 0.6 … as I don’t see any error logs I assume data wise it should be good.
…

BrightSilence · December 16, 2019, 6:16pm

Is there still a time out in place for audits? I can imagine the situation @tankmann describes with the system becoming unresponsive might start the audit interaction, but isn’t able to respond in time or basically doing anything meaningful like writing to the logs. I have noticed before that Synology can completely hang when running out of memory. Noticed this on my old Synology (6GB) at least 2 times. New one luckily has much more RAM (16GB).

For what it’s worth, my RAM usage has been pretty much flat over the past week. I doubt the excessive RAM use was caused by Storj.

tankmann · December 16, 2019, 9:07pm

Thanks for the reply @BrightSilence, so in the task manager it showed docker.
But remember that there is also this weird thing in terms of docker showing used amount X where as the process shows low - see screenshot below: 905 MB ram usage vs. 4.51 GB under the container. I remember you mentioned it was a bug in docker for synology but the last docker update didnt fix it…

BrightSilence · December 16, 2019, 9:59pm

Unfortunately the last docker update was a pretty small one and Synology is still pretty far behind on updates. If you want to know how much RAM is actually in use by your node you can look at the specific process by opening the detailed info of the container and going to the process tab.

So it’s only using 42MB.

Meanwhile, the container claims it’s using more than 5.5 GB.

tankmann · December 16, 2019, 10:27pm

Yes that’s clear - thanks.
When I checked the RAM I didn’t use above, just checked top on ssh
But coming back to the real topic, my real question are like above:

Yes, I didn’t check RAM/Swap usage and now will get notified, ok
But in the meantime / before: How could I have seen that there might be some audits fail?
And to be honest I still don’t understand if there are real failures in the logs I uploaded, I don’t find any errors beside the too slow / context cancelled
Only when I use the dashj.sh script I now see that it shows for example 16374/16203 under audits (that 171 which is the difference) do fail. But is it really fail as in missing data or just no answer due to node being unresponsible

I’m still not clear on that … also because I don’t understand why my new setup should loose any data at all?