Looks like it completely lost the drive with the files, How did you mount the drive is it a static mount? Also how is this drive attached to the system?
I had no probleme on other satellites, in the same time frame, on the same node, with data on the same disk.
Anyway I have these ‘delete failed’ errors on other satellites also: 2021-03-09T14:16:28.570Z ERROR piecedeleter delete failed {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Piece ID": "QTCD3XTHTWTCVZDVM3HPPLUZXRRRLKG5ESVWS5ROO6O6HTCIGCFQ", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storage/filestore.(*blobStore).Stat:99\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:239\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Delete:220\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:298\n\tstorj.io/storj/storagenode/pieces.(*Deleter).work:135\n\tstorj.io/storj/storagenode/pieces.(*Deleter).Run.func1:72\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
So I see an issue with this, Cause passthough usb doesnt always work fast enough when booting up.
Reason for this is because the other satellites have less data. So the more data it has the more it has a chance to get DQed from files being missing. Compare how much data is on all the satellites.
Domain Name Node ID Space Used
us2.tardigrade. io:7777 12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo 74.19 MB
saltlake.tardigrade. io:7777 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE 1.59 GB
asia-east-1.tardigrade. io:7777 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 40.99 MB
us-central-1.tardigrade. io:7777 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S 0.84 GB
europe-west-1.tardigrade. io:7777 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs 65.05 MB
europe-north-1.tardigrade. io:7777 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB 112.97 MB
The DQ’ed satellite is the one with most data.
This is on a server grade system, always on, all drives mounted at boot and mount is a prerequisite for docker to start.
/dev/sdj1 is the partition of the drive mounted (by its UUID) on /srv/dev-disk-by-label-storj/storagenode, then mounted in the docker container:
identity_dir="/srv/dev-disk-by-label-storj/storagenode/identity"
storage_dir="/srv/dev-disk-by-label-storj/storagenode/storage"
I’m quite sure it’s not a mount problem, because other satellites had successful traffic in the meantime of the failed audit
Yeah like I said the other satellites have less data so if you rebooted your server and some how docker started without the USB drive fully booted up because USB passthough that would be why it failed audits. Is OMV on a VM as well?
Yeah that maybe true but something had to have happened in order for it to loose data. Data just doesnt disappear. Not after running for 1 day of uptime. Either you have a bad drive or drive disconnected at some point.
What you should do is run a VM running ubuntu and passthough the usb to this instead so you can use EXT4 then see if you run into the same issues.
Yes thank you, but I guess it won’t make a difference. I choosed btrfs against ext4 because of its checksuming capability.
I have a satellite that is now at the same amount as DQ’ed one. Let’s see if it holds future audits.
To change Filesystem I will need to format the drive, and then, kill the node. As it is too young to gracefully exit, how can I stop it safely for the data’ I just make a dirty stop and others will repair or what?
The disconnected disk will not cause DQ, unless you run SETUP=true more than once. In case of missed disk the node will not start, if it were running it will crash.
But btrfs have some weird behaviors and can be incredible slow in some cases. So slow, that it’s enough to do not provide a piece 3 times with 5 minutes timeout. If node is new - the number of total audits is low, thus even one failed audit is enough to disqualify your node.
New node is started, logs are “almost” clean, only failure is: 2021-03-09T22:54:36.850Z INFO failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/lucas-clemente/quic-go/wiki/UDP-Receive-Buffer-Size for details.
BTRFS fs is mounted with these options: rw,nofail,noatime,ssd,discard,nodatacow,noautodefrag,compress=no,space_cache,inode_cache
I’ll let know in a couple of days.
I have already downloaded more chunks than before restarting from crash.
Maybe I previously failed when migrating data from a virtual disk on a physical disk previously.
The error message related to new feature to use QUIC for data transfers. It’s not finished and thus error in the INFO level. You can ignore it at the moment. But you would probably need to fix it in the future, when we enable QUIC for customers.
USB isn’t designed for 24/7 operation, it will in some cases disconnect and reconnect…
if i was to use USB i would use a known good USB controller / adapter card dedicated for the HDD’s.
one of the issues with USB is when you start connecting and disconnecting stuff on the bus.
this can disrupt the connections, power management can also be an issue… or simply that the controller isn’t perfectly stable and just reboots the bus from time to time.
generally if possible try to avoid USB for storagenodes, that being said… since your node only have issues with 1 satellite then i doubt the issue was USB related.