New SNO, Trying to understand why my node disqualified on a single satellite

R3yn4ld · March 9, 2021, 3:44pm

Hello,

New here and in storj game, I set up a storagenode yesterday.

I have a 100% uptime, and nice network/system/filesystem stats, but got this message:

Your node has been disqualified on 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S .

No problems on other satelites.

Here are status and logs (sorry, first message, cannot format things better)

I have a

GET_AUDIT", “error”: “file does not exist”, “errorVerbose”: "file does not exist\n\tstorj. io/common/rpc/rpcstatus…

and some

piecedeleter delete failed
“error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: "pieces error: filestore error: file does not exist\n\tstorj. io/storj/storage/filestore.(*blobStore)…

Just to be sure, I scrubbed (btrfs) the storage, no error.

Would someone please educate me on what happened ? (and how to maintain/prevent)
I was thinking to restart another node in a production VM, but should I try to “repair” this one?
How to gracefull stop this one?

Thank you!

I love the project and glad to be able to be part of it.

Best regards,
Reynald

deathlessdd · March 9, 2021, 3:49pm

Looks like it completely lost the drive with the files, How did you mount the drive is it a static mount? Also how is this drive attached to the system?

R3yn4ld · March 9, 2021, 4:35pm

Thank you
Yes, it is a static mount:

#!/bin/bash

identity_dir="/srv/dev-disk-by-label-storj/storagenode/identity"
storage_dir="/srv/dev-disk-by-label-storj/storagenode/storage"

docker run -d --restart unless-stopped --stop-timeout 300 \
 -p 28967:28967 -p 14002:14002 \
 -e WALLET="mywallet" \
 -e EMAIL="mymail" \
 -e ADDRESS="mydomain:28967" \
 -e STORAGE="1.71TB" \
 --mount type=bind,source="$identity_dir",destination=/app/identity \
 --mount type=bind,source="$storage_dir",destination=/app/config \
 --name storagenode storjlabs/storagenode:latest

I had success transfer after the failed audit:

I had no probleme on other satellites, in the same time frame, on the same node, with data on the same disk.

Anyway I have these ‘delete failed’ errors on other satellites also:
2021-03-09T14:16:28.570Z ERROR piecedeleter delete failed {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Piece ID": "QTCD3XTHTWTCVZDVM3HPPLUZXRRRLKG5ESVWS5ROO6O6HTCIGCFQ", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storage/filestore.(*blobStore).Stat:99\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:239\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Delete:220\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:298\n\tstorj.io/storj/storagenode/pieces.(*Deleter).work:135\n\tstorj.io/storj/storagenode/pieces.(*Deleter).Run.func1:72\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

deathlessdd · March 9, 2021, 5:01pm

How are the drives attached?

R3yn4ld · March 9, 2021, 5:02pm

USB passthrough the VM

Edit: only one drive, only one satelite failed, other are working properly

deathlessdd · March 9, 2021, 5:04pm

So I see an issue with this, Cause passthough usb doesnt always work fast enough when booting up.

Reason for this is because the other satellites have less data. So the more data it has the more it has a chance to get DQed from files being missing. Compare how much data is on all the satellites.

R3yn4ld · March 9, 2021, 6:51pm

Thank you,

Here is what I have:

Domain Name Node ID Space Used
us2.tardigrade. io:7777 12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo 74.19 MB
saltlake.tardigrade. io:7777 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE 1.59 GB
asia-east-1.tardigrade. io:7777 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 40.99 MB
us-central-1.tardigrade. io:7777 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S 0.84 GB
europe-west-1.tardigrade. io:7777 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs 65.05 MB
europe-north-1.tardigrade. io:7777 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB 112.97 MB
The DQ’ed satellite is the one with most data.

And here is my storage speed:

root@OpenMediaVault:/srv/dev-disk-by-label-storj/storagenode# hdparm -T /dev/sdj1

/dev/sdj1:
Timing cached reads: 11528 MB in 2.00 seconds = 5768.69 MB/sec
root@OpenMediaVault:/srv/dev-disk-by-label-storj/storagenode# dd if=/dev/zero of=test1.img bs=1G count=1 oflag=dsync
1+0 enregistrements lus
1+0 enregistrements écrits
1073741824 octets (1,1 GB, 1,0 GiB) copiés, 6,00147 s, 179 MB/s

deathlessdd · March 9, 2021, 7:08pm

So im not 100% sure but is OpenMediaVault like an OS for a NAS?

Also when you setup a static mount did you setup so it boots up with it?

I see your using 2 different mount points which one is the static one? /srv/dev-disk-by-label-storj/storagenode or /dev/sdj1?

R3yn4ld · March 9, 2021, 7:17pm

Yes OMV is a NAS system, Debian based.

This is on a server grade system, always on, all drives mounted at boot and mount is a prerequisite for docker to start.

/dev/sdj1 is the partition of the drive mounted (by its UUID) on /srv/dev-disk-by-label-storj/storagenode, then mounted in the docker container:
identity_dir="/srv/dev-disk-by-label-storj/storagenode/identity"
storage_dir="/srv/dev-disk-by-label-storj/storagenode/storage"

I’m quite sure it’s not a mount problem, because other satellites had successful traffic in the meantime of the failed audit

deathlessdd · March 9, 2021, 7:23pm

Yeah like I said the other satellites have less data so if you rebooted your server and some how docker started without the USB drive fully booted up because USB passthough that would be why it failed audits. Is OMV on a VM as well?

R3yn4ld · March 9, 2021, 7:24pm

OMV is on a VM, and storj docker is launch in OMV.

I have a 100% uptime since the node started, no reboots

deathlessdd · March 9, 2021, 7:28pm

Yeah that maybe true but something had to have happened in order for it to loose data. Data just doesnt disappear. Not after running for 1 day of uptime. Either you have a bad drive or drive disconnected at some point.

What you should do is run a VM running ubuntu and passthough the usb to this instead so you can use EXT4 then see if you run into the same issues.

R3yn4ld · March 9, 2021, 8:10pm

Yes thank you, but I guess it won’t make a difference. I choosed btrfs against ext4 because of its checksuming capability.

I have a satellite that is now at the same amount as DQ’ed one. Let’s see if it holds future audits.

To change Filesystem I will need to format the drive, and then, kill the node. As it is too young to gracefully exit, how can I stop it safely for the data’ I just make a dirty stop and others will repair or what?

deathlessdd · March 9, 2021, 8:17pm

Its only 1 day old so you can just delete it and start over again.

R3yn4ld · March 9, 2021, 8:25pm

Ok thank you for all your help, I will do like that

R3yn4ld · March 9, 2021, 11:03pm

Just started over a fresh Debian VM, same disk as before (emptied), docker install of a storagenode with new identity and authorization token.

Alexey · March 10, 2021, 4:08am

I would recommend to use ext4 over btrfs
https://forum.storj.io/tag/btrfs

The disconnected disk will not cause DQ, unless you run SETUP=true more than once. In case of missed disk the node will not start, if it were running it will crash.
But btrfs have some weird behaviors and can be incredible slow in some cases. So slow, that it’s enough to do not provide a piece 3 times with 5 minutes timeout. If node is new - the number of total audits is low, thus even one failed audit is enough to disqualify your node.

R3yn4ld · March 10, 2021, 2:47pm

Hello,

Thank you @Alexey

New node is started, logs are “almost” clean, only failure is:
2021-03-09T22:54:36.850Z INFO failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/lucas-clemente/quic-go/wiki/UDP-Receive-Buffer-Size for details.

BTRFS fs is mounted with these options:
rw,nofail,noatime,ssd,discard,nodatacow,noautodefrag,compress=no,space_cache,inode_cache

I’ll let know in a couple of days.
I have already downloaded more chunks than before restarting from crash.
Maybe I previously failed when migrating data from a virtual disk on a physical disk previously.

Alexey · March 11, 2021, 4:27am

The error message related to new feature to use QUIC for data transfers. It’s not finished and thus error in the INFO level. You can ignore it at the moment. But you would probably need to fix it in the future, when we enable QUIC for customers.

SGC · March 11, 2021, 8:56am

USB isn’t designed for 24/7 operation, it will in some cases disconnect and reconnect…
if i was to use USB i would use a known good USB controller / adapter card dedicated for the HDD’s.

one of the issues with USB is when you start connecting and disconnecting stuff on the bus.
this can disrupt the connections, power management can also be an issue… or simply that the controller isn’t perfectly stable and just reboots the bus from time to time.

generally if possible try to avoid USB for storagenodes, that being said… since your node only have issues with 1 satellite then i doubt the issue was USB related.