What is going on and how is this fixed?

7tigers · March 17, 2023, 12:21pm

Still trying to get a stable system going after the bandwidth.db corruption/failure

Every night Storj shuts down because of reported shard access and permissions issues, but yet I’ve been able to browse to the file that Storj is questioning the fabric of reality.

The network between the StorjVM and iSCSI Synology is very stable, no changes in 2+ years and the rest of the VM’s passing through the same interfaces/connections show no data connectivity issues.

Disk write cache is now turned off and confirmed System is the owner of the mounted drive.

I don’t mind running a command that re-applies the permissions to all files and folders, but would need guidance on syntax.

Error is here:

2023-03-16T20:53:17.720-0400	FATAL	Unrecoverable error	{"error": "FindNextFile E:\\Storj\\blobs\\6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa\\af: The system cannot find the file specified.; v0pieceinfodb: unable to open database file: The handle is invalid.; FindNextFile E:\\Storj\\blobs\\arej6usf33ki2kukzd5v6xgry2tdr56g45pp3aao6llsaaaaaaaa\\22: The system cannot find the file specified.; open E:\\Storj\\blobs\\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\\22: Access is denied.; open E:\\Storj\\blobs\\qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa\\24: Access is denied.; FindNextFile E:\\Storj\\blobs\\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa\\2c: The system cannot find the file specified.; FindNextFile E:\\Storj\\blobs\\v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa\\27: The system cannot find the file specified.", "errorVerbose": "group:\n--- FindNextFile E:\\Storj\\blobs\\6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa\\af: The system cannot find the file specified.\n--- v0pieceinfodb: unable to open database file: The handle is invalid.\n\tstorj.io/storj/storagenode/storagenodedb.(*v0PieceInfoDB).getAllPiecesOwnedBy:68\n\tstorj.io/storj/storagenode/storagenodedb.(*v0PieceInfoDB).WalkSatelliteV0Pieces:97\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:528\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:681\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n--- FindNextFile E:\\Storj\\blobs\\arej6usf33ki2kukzd5v6xgry2tdr56g45pp3aao6llsaaaaaaaa\\22: The system cannot find the file specified.\n--- open E:\\Storj\\blobs\\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\\22: Access is denied.\n--- open E:\\Storj\\blobs\\qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa\\24: Access is denied.\n--- FindNextFile E:\\Storj\\blobs\\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa\\2c: The system cannot find the file specified.\n--- FindNextFile E:\\Storj\\blobs\\v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa\\27: The system cannot find the file specified."}
2023-0

Alexey · March 18, 2023, 1:38pm

Perhaps it’s a time to reconsider your solution. ANY network-attached storages are not reliable.

You need to check, why iSCSI start to have problems in your setup. Usually it requires a separate network, did you do this?

Regarding permissions, usually it is enough to apply user’s full permissions (recursively) to the data location.

7tigers · March 20, 2023, 12:54am

The iSCSI connection is on a separate physical interface and VLAN on the network

The network is rock solid with no interface, buffer, or overrun errors across the router, switch, or NAS

Is there a flag to set on the node to run a fetch-repair on all shards that are causing trouble instead of shutting down every night?

Is there any sort of “scrubbing” the node can run to validate shard integrity and catch this kind of issue before it becomes a major headache and repair/retry a different directory?

I can archive the Storj data and migrate to a new iSCSI slice but it will take a while to sync to/from the array…

Alexey · March 20, 2023, 1:42am

Storagenode is not replacement for OS tools, it’s doing only what is it designed for: give an access for the customers to your storage.
The database repair you need to do yourself. Lost pieces cannot be recovered to the same node, only to a different one, when the repair job got triggered on one of the repair workers (they are part of the satellite) only when an amount of healthy pieces for the segment is below the repair threshold.

You may migrate with rsync also while node is running: How do I migrate my node to a new device? - Storj Node Operator Docs
To speedup a process you may reduce the allocation below the used space, and your node will stop accept any new data.

7tigers · March 20, 2023, 12:51pm

A way for Storj to retry writing shards to a different location/directory and an early warning of legitimate failed writes in this unfortunate scenario is worth it’s weight in gold.

I did find the array was running much slower than normal and the cause was Windows Defragment running in the background.

Defrag disabled, I hope it now continues an upward trend in stability

I am working on the migration path off iSCSI to dedicated storage, I just need Storj to behave for a little while longer.

7tigers · March 20, 2023, 4:33pm

As an update, I enabled jumbo frames on the network and on the servers

Pending measurement of performance and stability