Hope this isn’t the end. I have a 6TB node that will be down for at least a week while I wait for parts to arrive for my injured server. Am i going to be able to recover? Or should i start using the HDD’s as clay pigeons to relive stress? Is there any way to “pause” (with a penalty or something), but not lose my 3 year reputation?
As long as your data is still intact, don’t go smashing those drives into clay pigeons just yet! You can definitely recover after being offline for a week. Your 3-year rep won’t vanish that easily. Good luck!
I was down for two weeks last year. Node has fully recovered a long time ago, and is happily hosting data
After 288 hours (12 days) offline the node will be suspended on each satellite and after 30 days offline it will be disqualified. A suspended node can be recovered if it’s online during 30 days… but it wont have any ingress.
Best explanation:
Well, after, 11-ish days, i was able to get everything moved over to the new server (wow moving these storj files is SLOW AF…got to find a faster way to move millions of tiny files).
Had some DB corruption, so i blew out the whole DB and let it rebuild. Somehow accidently put an old Identity in the Docker folder, but thankfully i had the OG one still in the data disk (that scared me for a while).
BUT i’m back up and running…only showing I used 1TB so far, but i’m sure once Luke Filewalker gets their jobs completed, that will grow again. All the satellites are 100% for suspension and audit, but (of course) the online is sitting between 63% and 66%. I will watch that number grow again to 100% with glee!
Word to the wise, don’t plug an old HBA card into a x4 (x16 size) slot when it needs a x8! Apparently, it cooks the HBA card, and corrupts the Intel ME chip on the MOBO!! I will be trying to flash that back to life down the road, as its not a cheap Dell server!
Thanks for the replies to calm my panic!
Zfs send/receive is the answer, with optional mbuffer in between (that absolves you from needed nc/ssh).
Some, especially old, enterprise cards, (and also some consumer cards) are very particular about PCIE lane count — for the reasons that evade my comprehension — I have seen plenty of network adapters that want exact count of lanes and won’t boot if you plug it in in a port with fewer lanes wired.
This is bad design and I would avoid such vendors — especially on Intel systems where PCIE lanes are scarce resource.
But corrupting the firmware and/or hardware is the next level of recklessness. Please name and shame the vendor.
…LSI…sadly…in their defense, its really old card (SAS9200-8e)!
I wish it just didn’t boot, but no…it did more!
I defiantly need to figure out snapshots. This was a BTRFS pool.
Also dd
and pipe to gzip
then to ssh
, however, it could be corrupted in between and/or interrupted, so rsync
or/and rclone sync
or zfs send
/zfs receive
as suggested by @arrogantrabbit , but you of course need zfs
for that case
Oops. I do not think that it’s possible to send and receive a BTRFS snapshot.
It’s also slow for file-by-file transfer too.
Perhaps rclone sync
could do better.
It’s the same mechanism, almost verbatim, btrfs send/btrfs receive.
I clearly need to learn more then I currently know! Although, before 5 years ago, i never touched a Linux OS at all; now I have x2 24/7 NAS(ish) servers running; one in production for a friends company (i broke my personal one…otherwise i would have just used company money and been back up in 48 hours max). I’m slowly learning!
I want to get to the place where i can have a data-center setup and sell space on it. Clearly, i’m not ready yet, but every day i learn something new!
Way off track from the OP, and storj, but I need to figure out how to use this for my off-site server backups! Running backup software for each file atm (with time differential backups, but each file has to move, not a whole FS partition). It is tedious and prone to failure…something tells me i can have deferential backups setup with this etc!
You may also use a specialized backup programs like restic or duplicacy, see other backup software. They are much better than just sync and usually faster, also they use compression and snapshots, even if the underlaying FS doesn’t have them.
Dammit…my EU node was disqualified (audit 95.97%)…must have had some files that didn’t go over properly (I’m guessing there was corruption from the whole server eating )
I’m guessing there is no coming back from that?!?!
Should i be worried that my other sat’s are going to give up on me too (Audit = 98% on x2 of them)?
With the EU sat no longer giving/getting data, are my other sats going to be able to handle more and make up for leaving the EU?
I feel like 1776 USA right now.
Unfortunately the disqualification is permanent and not reversible. There is a not zero chance that all other satellites will follow, if you have a files corruption.
You need to stop the node asap and check logs, it could be just a permission issue, which could be fixed easily and would allow to save from DQ on other satellites.
If the node would start to pass audits, the audit score for other satellites can recover.
If your bandwidth was saturated before - it is possible. But just having less EU1 customers doesn’t mean increased load from the customers of other satellites.
The customers’ behavior is not predictable. But yes, you may free up the space which has been used by the customers of the EU1 satellite, you may use this instruction for the EU1 satellite: How To Forget Untrusted Satellites.
If you have more disks, you may start the second node, it would take a full possible traffic from the customers of EU1 and would share the possible traffic from the customers of others satellites with the first node. Just generate a new identity (do not copy the existing one, otherwise it will be disqualified almost instantly), sign with a new authorization token and run the second node using different external ports for the node and for the dashboard (if you use docker that should be enough), for binary setups and with --network host
in docker you also need to use unique ports for a few more addresses, see How to add an additional drive? - Storj Docs.
It’s wrong. If you use --user $(id -i):$(id -g)
, then the owner must be $(id -i):$(id -g)
and you should not use sudo
to run a container. Or do not use a --user
option and run the container with sudo
, however, in that case the owner must be root:root
.
I shall make that change…I just checked my other nodes too and realized my mistake. thx
You are welcome, and sorry that the node has been DQ on at least one satellite because of this mistake.
Please check the logs, that you do not have any issues for GET_AUDIT
and GET_REPAIR
.
So after reviewing the logs, it wasn’t because of that mistake…there was no problems uploading/downloading the files to/from the node. It was due to corruption of the files when the server crashed.
The permissions were set up properly, i had just panicked.
just to close this off…lost my node…too much corruption on the pool i guess.
oh well, start over again i guess…ffs
Sorry to hear. Was it a RAID some kind?