a few days ago i found out the drive i use for storj was shut down (my node is à raspberry pi 4 with a usb drive)
the node has been up for almost 3 weeks, without problem (i think)
so, a few days ago i tried to list the content of the drive, it woke up but took ages to answer, so i restarted (cleanly) the whole raspberry and it seemed to work again.
i found the drive in the same state the next day, so i wanted to migrate from the dedicated drive (4TB, 3.2 allowed for the node) to the one i use as “NAS” (8TB, at least 5 free) with an “rsync” command (to a dedicated folder)
i updated the node configuration and everything seemed ok (yesterday)
this morning (i’m in europe) i found out the node had 2.6TB space left, it said 1TB 3 days agos, and now i have the message " Your node has been disqualified on 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE . If you have any questions regarding this please check our Node Operators thread on Storj forum."
is there anything for me to do ? i guess something went wrong with the “rsync” command but i don’t know what, and how to correct it.
i still have the previous drive, but maybe it is too late.
should i just create a new identity and clear the files the actual node has ?
is there any way to migrate cleanly or to shut the node correctly (i think the node has mainly test data from “saltlake”) and avoid a “repair” operation on the data still present on my node’s drive ?
my guess for the whole thing is : initially, the node had 3TB available, so it constantly had data coming in to fill it up, and never had enough time idle to go in deep sleep mode
once there only had 1TB free, there was less data incoming and the drive got into deep sleep, and the combination of the drive (seagate baracuda) with the case (icy box usb3, specific model unidentified) make it hard to wake up
well, not really, as i just found this post after my message here
in fact, i have all the storj scripts on the sd card of the pi, so no identity or such to migrate (i guess)
my drive was mounted in /media/storj and this is what i gave to the docker as parameter
i did an “rsync” from /media/storj/ to /media/other_drive/storj/ (a newly created folder on the other drive) and changed the path in the parameter given to the docker
after the rsync i did an “ls -al” on both the old and the new location and it seemed ok (as far as i remember, the same folders and files where existing in both)
reading the “migrate-my-node” faq, i guess i should have “rsynced” only the “storage” folder in /media/storj/ ans skipped the “config.yaml”, “lost+found”, “revocation.db” and “thrust-cache.json” that where in the same root folder :s
edit, by the way, i used “rsync -arv” instead of “-aP”
Your way is working too, just need to remove the config.yaml, to allow to recreate it by the container.
Perhaps you didn’t rsync the last time after the node shutdown and some pieces leaved on the old drive.
Also, the network connected drive is much slower than local connected drives and you could lose pieces during transfer with any lose of network packets.
The SMB and NFS are not compatible with storagenode, the only compatible protocol is iSCSI.
So, you nothing can do right now, because time is gone. You still can run that node with other satellites, until they disqualify it too, or you can start from scratch.
in fact, the pi and my 8TB drive is my NAS, I used a 4TB drive dedicated to storj which disconnected for unknown reason
i did 2 rsync, the first with the node up, and the second after the node was shut, so I don’t understand why it lost data from saltlake (the other satellites have not disqualified the node … yet ?)
is there a way to cleanly remove the node from the network ?
i mean, not just shut it, but forbid any incoming data, and be sure the data already in the node have been transferred on another one ? thus avoiding the need to “repair” the data
yesterday evening i found out my main drive (the 8TB one) was offline too, it never happened before, it worked perfectly as my NAS with an uptime of more than 60 days, and about 24h after i used it for the storj node it went offline
I found out my 4TB drive is “SMR”
I’m almost sure the 8TB also is, it is a seagate desktop drive so I don’t have the internal reference, but some who opened it (in amazon comments) found a reference and I think it is one listed in Seagate 'submarines' SMR into 3 Barracuda drives and a Desktop HDD – Blocks and Files
that would explain why I heard it work, but without any action from the raspberry (whith the iotop command)
any idea on why a drive would disconnect ?
the “docker -ps al” command listed the node as stopped, for 3minutes, when I found out i couldn’t access the drive
I don’t remember if disconnecting / reconnecting the usb was enough, or if i had to reboot de raspberry to have everything back up to normal
Why is disqualification definitive? I mean in the current situation: if @fry were to find the root cause and solve it, it could stay online for all sat’ but the one it’s been disqualified on: Considering this, there could be a way to “re-apply” for the sat’ which disqualified the node for a “start afresh”.
@Alexey: In the current state of things, I don’t see why any SNO would want to keep a node online when disqualified on at least one satellite?
Why didn’t fry get an e-mail notification telling them that something was starting to fail on some sat’, so they could look into it before discovering one morning that it’s too late? I mean, that’s a level of frustration SotrjLabs may want to avoid as much as possible if they want to keep they SNO aboard…
I’m still not getting why disqualification is such a punitive thing honnestly. It makes sense to disqualify a node if it fails at providing the service. But in this situation, it would make sense to me to have a way to start afresh in a simple way.
Maybe it’s just me
That’s my 2 cts.
I’m still kinda happy with Storj for now, but let’s say I think there is plenty of room for improvement
A node will be disqualified for consistently returning bad data during audits - this could happen if data they are supposed to be storing is lost or corrupted. These issues are serious and this is why disqualification is definitive.
For less serious issues that cause errors that could easily be fixed by a node operator (e.g. configuration issues like not being able to read from a DB because of permissions), we “suspend” instead of disqualifying (see @alexey’s link above). A node will only be disqualified from suspension mode if they do not fix the issue causing these errors within a week.
Only for unknown audit failures. Missing or corrupt files would still rightfully lead to disqualification.
There is no way for the satellite to know a problem is actually fixed. Someone could just be “trying it again”. The harsh punishment is there for a good reason. That said, you can work around it, but it requires some effort.
Simply do the following:
Start a second node on the same machine.
Wait until the new node is vetted
Reduce the allocated size of the old node to 0 t o ensure new data goes to the new node
Either keep the old node running for egress or gracefully exit to get your held back amount back. This may need to be a phased approach if your node isn’t old enough yet for graceful exit.
This would transition you to a fully working new node on all satellites without significant loss of income. There is still some loss of income and the new node will start keeping amounts held back again, but that’s the price you pay for being disqualified in the first place.
i started the node 3 weeks ago, the dashboard estimates the payout to less than 1$, I think I will try the graceful exit, not for the income, but for the stability of the network, even if i’m not sure loosing the 600GB still on my node would change anything :D.
i’m ok to do what you explained, i will just wait until i have a drive without SMR available to start a new node.
before the DQ from saltlake, the dashboard indicated there was 1TB free, it is now 2.59 (3.2TB allowed in the startup script), did the DQ correctly removed the data sent by the satellite ? or is there some clean up to do ?
Graceful exit won’t work unless your node is 6 months old. I would say just keep this node running until you have a non-SMR drive and then still follow the plan I outlined above. The SMR drive is going to have issues with the amount of traffic from Saltlake anyway right now.
I don’t believe the node cleans up data after disqualification. But I don’t know what the best practice is there. It’s usually a bad idea to remove data yourself. One mistake and you’ve messed up other satellites as well. There is a blobs folder per satellite, but it does not have a human readable name and you really don’t want to remove the wrong one.