Disqualification

fry · April 20, 2020, 9:13am

hello,

a few days ago i found out the drive i use for storj was shut down (my node is à raspberry pi 4 with a usb drive)
the node has been up for almost 3 weeks, without problem (i think)
so, a few days ago i tried to list the content of the drive, it woke up but took ages to answer, so i restarted (cleanly) the whole raspberry and it seemed to work again.
i found the drive in the same state the next day, so i wanted to migrate from the dedicated drive (4TB, 3.2 allowed for the node) to the one i use as “NAS” (8TB, at least 5 free) with an “rsync” command (to a dedicated folder)
i updated the node configuration and everything seemed ok (yesterday)
this morning (i’m in europe) i found out the node had 2.6TB space left, it said 1TB 3 days agos, and now i have the message " Your node has been disqualified on 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE . If you have any questions regarding this please check our Node Operators thread on Storj forum."

is there anything for me to do ? i guess something went wrong with the “rsync” command but i don’t know what, and how to correct it.
i still have the previous drive, but maybe it is too late.
should i just create a new identity and clear the files the actual node has ?
is there any way to migrate cleanly or to shut the node correctly (i think the node has mainly test data from “saltlake”) and avoid a “repair” operation on the data still present on my node’s drive ?

my guess for the whole thing is : initially, the node had 3TB available, so it constantly had data coming in to fill it up, and never had enough time idle to go in deep sleep mode
once there only had 1TB free, there was less data incoming and the drive got into deep sleep, and the combination of the drive (seagate baracuda) with the case (icy box usb3, specific model unidentified) make it hard to wake up

thanks in advance for any hint

nerdatwork · April 20, 2020, 11:17am

Welcome to the forum @fry!

Did you follow steps from above link ?

fry · April 20, 2020, 12:22pm

thank you

well, not really, as i just found this post after my message here
in fact, i have all the storj scripts on the sd card of the pi, so no identity or such to migrate (i guess)
my drive was mounted in /media/storj and this is what i gave to the docker as parameter
i did an “rsync” from /media/storj/ to /media/other_drive/storj/ (a newly created folder on the other drive) and changed the path in the parameter given to the docker

after the rsync i did an “ls -al” on both the old and the new location and it seemed ok (as far as i remember, the same folders and files where existing in both)

reading the “migrate-my-node” faq, i guess i should have “rsynced” only the “storage” folder in /media/storj/ ans skipped the “config.yaml”, “lost+found”, “revocation.db” and “thrust-cache.json” that where in the same root folder :s

edit, by the way, i used “rsync -arv” instead of “-aP”

Alexey · April 21, 2020, 9:00am

Your way is working too, just need to remove the config.yaml, to allow to recreate it by the container.
Perhaps you didn’t rsync the last time after the node shutdown and some pieces leaved on the old drive.
Also, the network connected drive is much slower than local connected drives and you could lose pieces during transfer with any lose of network packets.
The SMB and NFS are not compatible with storagenode, the only compatible protocol is iSCSI.
So, you nothing can do right now, because time is gone. You still can run that node with other satellites, until they disqualify it too, or you can start from scratch.

fry · April 21, 2020, 9:20am

hello,

in fact, the pi and my 8TB drive is my NAS, I used a 4TB drive dedicated to storj which disconnected for unknown reason
i did 2 rsync, the first with the node up, and the second after the node was shut, so I don’t understand why it lost data from saltlake (the other satellites have not disqualified the node … yet ?)

is there a way to cleanly remove the node from the network ?
i mean, not just shut it, but forbid any incoming data, and be sure the data already in the node have been transferred on another one ? thus avoiding the need to “repair” the data

yesterday evening i found out my main drive (the 8TB one) was offline too, it never happened before, it worked perfectly as my NAS with an uptime of more than 60 days, and about 24h after i used it for the storj node it went offline
I found out my 4TB drive is “SMR”
I’m almost sure the 8TB also is, it is a seagate desktop drive so I don’t have the internal reference, but some who opened it (in amazon comments) found a reference and I think it is one listed in Seagate 'submarines' SMR into 3 Barracuda drives and a Desktop HDD – Blocks and Files
that would explain why I heard it work, but without any action from the raspberry (whith the iotop command)

any idea on why a drive would disconnect ?
the “docker -ps al” command listed the node as stopped, for 3minutes, when I found out i couldn’t access the drive
I don’t remember if disconnecting / reconnecting the usb was enough, or if i had to reboot de raspberry to have everything back up to normal

Pac · April 21, 2020, 2:58pm

I’m wondering though:

Why is disqualification definitive? I mean in the current situation: if @fry were to find the root cause and solve it, it could stay online for all sat’ but the one it’s been disqualified on: Considering this, there could be a way to “re-apply” for the sat’ which disqualified the node for a “start afresh”.
@Alexey: In the current state of things, I don’t see why any SNO would want to keep a node online when disqualified on at least one satellite?
Why didn’t fry get an e-mail notification telling them that something was starting to fail on some sat’, so they could look into it before discovering one morning that it’s too late? I mean, that’s a level of frustration SotrjLabs may want to avoid as much as possible if they want to keep they SNO aboard…

I’m still not getting why disqualification is such a punitive thing honnestly. It makes sense to disqualify a node if it fails at providing the service. But in this situation, it would make sense to me to have a way to start afresh in a simple way.

Maybe it’s just me

That’s my 2 cts.
I’m still kinda happy with Storj for now, but let’s say I think there is plenty of room for improvement

Alexey · April 21, 2020, 6:07pm

All satellites pay independently. So other satellites will still pay for usage.
The only reason to shutdown the node completely, if all satellites disqualified your node.

This is the exact reason for disqualifying - the node is online, but data unavailable => did not provide the service => disqualification

You can add your idea or vote for existing here: https://ideas.storj.io
FYI - we design the suspend mode which should be applied before the actual disqualification:

moby · April 21, 2020, 8:18pm

A node will be disqualified for consistently returning bad data during audits - this could happen if data they are supposed to be storing is lost or corrupted. These issues are serious and this is why disqualification is definitive.

For less serious issues that cause errors that could easily be fixed by a node operator (e.g. configuration issues like not being able to read from a DB because of permissions), we “suspend” instead of disqualifying (see @alexey’s link above). A node will only be disqualified from suspension mode if they do not fix the issue causing these errors within a week.

BrightSilence · April 22, 2020, 9:52am

Only for unknown audit failures. Missing or corrupt files would still rightfully lead to disqualification.

There is no way for the satellite to know a problem is actually fixed. Someone could just be “trying it again”. The harsh punishment is there for a good reason. That said, you can work around it, but it requires some effort.

Simply do the following:

Start a second node on the same machine.
Wait until the new node is vetted
Reduce the allocated size of the old node to 0 t o ensure new data goes to the new node
Either keep the old node running for egress or gracefully exit to get your held back amount back. This may need to be a phased approach if your node isn’t old enough yet for graceful exit.

This would transition you to a fully working new node on all satellites without significant loss of income. There is still some loss of income and the new node will start keeping amounts held back again, but that’s the price you pay for being disqualified in the first place.

fry · April 22, 2020, 11:39am

hello
i started the node 3 weeks ago, the dashboard estimates the payout to less than 1$, I think I will try the graceful exit, not for the income, but for the stability of the network, even if i’m not sure loosing the 600GB still on my node would change anything :D.

i’m ok to do what you explained, i will just wait until i have a drive without SMR available to start a new node.

before the DQ from saltlake, the dashboard indicated there was 1TB free, it is now 2.59 (3.2TB allowed in the startup script), did the DQ correctly removed the data sent by the satellite ? or is there some clean up to do ?

BrightSilence · April 22, 2020, 11:43am

Graceful exit won’t work unless your node is 6 months old. I would say just keep this node running until you have a non-SMR drive and then still follow the plan I outlined above. The SMR drive is going to have issues with the amount of traffic from Saltlake anyway right now.

I don’t believe the node cleans up data after disqualification. But I don’t know what the best practice is there. It’s usually a bad idea to remove data yourself. One mistake and you’ve messed up other satellites as well. There is a blobs folder per satellite, but it does not have a human readable name and you really don’t want to remove the wrong one.

fry · April 22, 2020, 1:02pm

well, it seems I will soon be disqualified form the other satellites, my 8TB drive was disconnected almost an hour ago (and then the node shut itself down, or crashed I’m not sure)

extracts from the logs :
Apr 22 12:52:53 nextcloudpi rngd[352]: stats: Time spent starving for entropy: (min=0; avg=0.000; max=0)us
Apr 22 12:54:40 nextcloudpi kernel: [133315.691563] BTRFS warning (device sda1): csum failed root 5 ino 62543 off 36122624 csum 0x265dc12a expected csum 0x32d8fee9 mirror 1
Apr 22 12:55:01 nextcloudpi CRON[6938]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 22 13:00:01 nextcloudpi CRON[6966]: (www-data) CMD (php -f /var/www/nextcloud/cron.php)
Apr 22 13:05:01 nextcloudpi CRON[6994]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 22 13:08:28 nextcloudpi kernel: [134143.450121] sd 0:0:0:0: [sda] tag#8 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD
Apr 22 13:08:28 nextcloudpi kernel: [134143.450140] sd 0:0:0:0: [sda] tag#8 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
…
Apr 22 13:08:47 nextcloudpi kernel: [134163.058081] sd 0:0:0:0: [sda] tag#0 uas_eh_abort_handler 0 uas-tag 21 inflight: CMD OUT
Apr 22 13:08:47 nextcloudpi kernel: [134163.058094] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 01 55 de 9d a0 00 00 00 40 00 00
Apr 22 13:08:47 nextcloudpi kernel: [134163.160436] scsi host0: uas_eh_device_reset_handler start
Apr 22 13:08:52 nextcloudpi kernel: [134168.230781] usb 2-2: Disable of device-initiated U1 failed.
Apr 22 13:08:57 nextcloudpi kernel: [134173.270852] usb 2-2: Disable of device-initiated U2 failed.
Apr 22 13:08:58 nextcloudpi kernel: [134173.421576] usb 2-2: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd
Apr 22 13:08:58 nextcloudpi kernel: [134173.457795] scsi host0: uas_eh_device_reset_handler success
Apr 22 13:09:01 nextcloudpi CRON[7022]: (root) CMD ( [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)
Apr 22 13:09:03 nextcloudpi systemd[1]: Starting Clean php session files…
Apr 22 13:09:03 nextcloudpi systemd[1]: phpsessionclean.service: Succeeded.
Apr 22 13:09:03 nextcloudpi systemd[1]: Started Clean php session files.
Apr 22 13:09:05 nextcloudpi kernel: [134180.405357] xhci_hcd 0000:01:00.0: WARNING: Host System Error
Apr 22 13:09:10 nextcloudpi kernel: [134185.430779] xhci_hcd 0000:01:00.0: xHCI host not responding to stop endpoint command.
Apr 22 13:09:10 nextcloudpi kernel: [134185.430825] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
Apr 22 13:09:10 nextcloudpi kernel: [134185.431609] xhci_hcd 0000:01:00.0: HC died; cleaning up
Apr 22 13:09:10 nextcloudpi kernel: [134185.431937] usb 1-1: USB disconnect, device number 2
Apr 22 13:09:10 nextcloudpi kernel: [134185.432833] usb 2-2: USB disconnect, device number 2
Apr 22 13:09:10 nextcloudpi kernel: [134185.433290] sd 0:0:0:0: [sda] tag#9 uas_zap_pending 0 uas-tag 2 inflight: CMD
Apr 22 13:09:10 nextcloudpi kernel: [134185.433308] sd 0:0:0:0: [sda] tag#9 CDB: opcode=0x8a 8a 00 00 00 00 01 55 de 9e 20 00 00 00 20 00 00
Apr 22 13:09:10 nextcloudpi kernel: [134185.433337] sd 0:0:0:0: [sda] tag#9 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Apr 22 13:09:10 nextcloudpi kernel: [134185.433351] sd 0:0:0:0: [sda] tag#9 CDB: opcode=0x8a 8a 00 00 00 00 01 55 de 9e 20 00 00 00 20 00 00
Apr 22 13:09:10 nextcloudpi kernel: [134185.433364] print_req_error: I/O error, dev sda, sector 5735620128
Apr 22 13:09:10 nextcloudpi kernel: [134185.433383] BTRFS error (device sda1): bdev /dev/sda1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
…
Apr 22 13:09:10 nextcloudpi kernel: [134185.434979] BTRFS warning (device sda1): chunk 13631488 missing 1 devices, max tolerance is 0 for writeable mount
Apr 22 13:09:10 nextcloudpi kernel: [134185.434996] BTRFS: error (device sda1) in write_all_supers:3716: errno=-5 IO failure (errors while submitting device barriers.)
Apr 22 13:09:10 nextcloudpi kernel: [134185.435040] BTRFS info (device sda1): forced readonly
Apr 22 13:09:10 nextcloudpi kernel: [134185.435057] BTRFS: error (device sda1) in btrfs_sync_log:3187: errno=-5 IO failure
Apr 22 13:09:10 nextcloudpi kernel: [134185.441705] sd 0:0:0:0: [sda] Synchronizing SCSI cache
Apr 22 13:09:10 nextcloudpi udisksd[374]: Cleaning up mount point /media/8to_main (device 8:1 no longer exists)
Apr 22 13:09:10 nextcloudpi systemd[1]: media-8to_main.mount: Succeeded.
Apr 22 13:09:10 nextcloudpi systemd[1]: Stopping Clean the /media/8to_main mount point…
Apr 22 13:09:10 nextcloudpi systemd[1]: clean-mount-point@media-8to_main.service: Succeeded.
Apr 22 13:09:10 nextcloudpi nc-automount-links-mon[369]: 8to_main DELETE,ISDIR
Apr 22 13:09:10 nextcloudpi systemd[1]: Stopped Clean the /media/8to_main mount point.
Apr 22 13:09:10 nextcloudpi kernel: [134185.581093] BTRFS warning (device sda1): Skipping commit of aborted transaction.
Apr 22 13:09:10 nextcloudpi kernel: [134185.583910] BTRFS info (device sda1): delayed_refs has NO entry
Apr 22 13:09:10 nextcloudpi kernel: [134186.070801] sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=0x07 driverbyte=0x00
Apr 22 13:09:10 nextcloudpi kernel: [134186.171378] xhci_hcd 0000:01:00.0: WARN Can’t disable streams for endpoint 0x81, streams are being disabled already
Apr 22 13:09:15 nextcloudpi kernel: [134190.471128] btrfs_dev_stat_print_on_error: 105 callbacks suppressed

and then
Apr 22 13:09:21 nextcloudpi kernel: [134196.952619] docker0: port 2(vetha371623) entered disabled state

i guess i should just shut the node down for now and put everything online with a non-SMR drive, without changing anything else

Pac · April 22, 2020, 9:55pm

That should be double checked I think… Because it would be useless to keep the node up on other sattelites if the one which disqualified occupied most of the space. Which is fry’s case I believe.

Good to know the suspension mode is being worked on. Hopefully a node will be able to go to that mode when a disk is unreachable, instead of being DQ.

@Alexey & @moby & @BrightSilence: Thx for your insights

nerdatwork · April 23, 2020, 2:02am

Its suspension mode. Containment mode is different

kevink · April 23, 2020, 5:34am

I did GE on 2 old nodes only on Stefan benten satellite and all of its data did get deleted. Some of it took a week because it went to the trash folder but it is gone now.

Pac · April 23, 2020, 5:59am

@kevink GE might do it, but that doesn’t apply to disqualified nodes. What I’m wondering is if data that got disqualified gets deleted.

@nerdatwork right, sorry I’ll fix my post ^^ Thx.

kevink · April 23, 2020, 6:27am

Ah yes, sorry. Guess I just read too quickly.
would indeed be bad if the data stays around after dq