I’ve had a brief opportunity to parse through some of the log file, and there are some interesting things worth noting. First a timeline:
13:51: Disk went offline according to Event Viewer. Storj node log starts showing empty new lines, prior to that it was successfully uploading and downloading pieces.
15:04: Received Email that AP1 was disqualified.
15:04: Received Email that EU1 was disqualified.
15:15: Received Email that Saltlake was disqualified.
15:50: Received Email that US1 was disqualified.
15:55-16:15: I saw the AP1 Email, checked the server, and brought the disk back online. Dashboard was still showing 100% suspension & audit for all satellites (presumably due to the default value of nodestats.reputation-sync which appears to be 4 hours).
19:03: I manually restarted the node after seeing H drive usage was at 0% utilization, dashboard started showing a warning that the other satellites were disqualified and the audits are ~96%.
For reference, here is the reputation information from the log after the reboot when the satellites showed as disqualified:
2023-08-30T19:07:53-04:00 INFO reputation:service node scores updated {"Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Total Audits": 412461, "Successful Audits": 394329, "Audit Score": 1, "Online Score": 0.9981712725060272, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T19:07:53-04:00 INFO reputation:service node scores updated {"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Total Audits": 953073, "Successful Audits": 927705, "Audit Score": 0.959809440525067, "Online Score": 0.9989558320629611, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T19:07:54-04:00 INFO reputation:service node scores updated {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Total Audits": 1297653, "Successful Audits": 1269795, "Audit Score": 0.9598094405250738, "Online Score": 0.9988364383003573, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T19:07:54-04:00 INFO reputation:service node scores updated {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Total Audits": 1765957, "Successful Audits": 1728037, "Audit Score": 0.9598094405249591, "Online Score": 0.9981104464364042, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T19:07:54-04:00 INFO reputation:service node scores updated {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Total Audits": 1386690, "Successful Audits": 1353832, "Audit Score": 0.9598094405250738, "Online Score": 0.9982045692256415, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T19:07:55-04:00 INFO reputation:service node scores updated {"Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Total Audits": 1979614, "Successful Audits": 1936861, "Audit Score": 1, "Online Score": 0.9958238630652381, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
and about 20 minutes before the disk went offline:
2023-08-30T13:32:07-04:00 INFO reputation:service node scores updated {"Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Total Audits": 412461, "Successful Audits": 394329, "Audit Score": 1, "Online Score": 0.9981712725060272, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T13:32:07-04:00 INFO reputation:service node scores updated {"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Total Audits": 953023, "Successful Audits": 927696, "Audit Score": 0.9999999999999925, "Online Score": 0.9989558320629611, "Suspension Score": 1, "Audit Score Delta": 0.0000000000000008881784197001252, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T13:32:07-04:00 INFO reputation:service node scores updated {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Total Audits": 1297593, "Successful Audits": 1269777, "Audit Score": 1, "Online Score": 0.9988364383003573, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T13:32:08-04:00 INFO reputation:service node scores updated {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Total Audits": 1765905, "Successful Audits": 1728026, "Audit Score": 0.9999999999998772, "Online Score": 0.9981104464364042, "Suspension Score": 1, "Audit Score Delta": 0.000000000000012989609388114332, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T13:32:08-04:00 INFO reputation:service node scores updated {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Total Audits": 1386632, "Successful Audits": 1353815, "Audit Score": 1, "Online Score": 0.9982045692256415, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T13:32:08-04:00 INFO reputation:service node scores updated {"Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Total Audits": 1979614, "Successful Audits": 1936861, "Audit Score": 1, "Online Score": 0.9958238630652381, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
so it looks like it failed ~40 audits (presumably in a row, total audits were 50-60 per satellite during that timeframe) over the course of an hour or so before being disqualified. Is that a reasonable expectation of time to react to and fix a correctable, non data compromising issue? Why shouldn’t this result in a suspension and notification first? Or at a minimum have the node go offline if it is failing multiple audits in a row to help save itself from being disqualified before it can be addressed by the operator?