Docker loop after harddisk were unplugged

mattamore · September 29, 2020, 9:43am

Hi,

My node is offline since a few hours now, I accidentally unplugged my USB drive and the docker is not booting anymore.
Here’s some logs, I don’t know which is fatal, and I don’t how how to fix the errors, if it’s possible.
Any help appreciated, thanks

2020-09-29T09:29:48.777Z	ERROR	piecestore:cache	error getting current space used calculation: 	{"error": "pieces error: failed to enumerate satellites: readdirent: bad message", "errorVerbose": "pieces error: failed to enumerate satellites: readdirent: bad message\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:644\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func1:56\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-09-29T09:29:48.777Z	ERROR	contact:service	ping satellite failed 	{"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "attempts": 1, "error": "ping satellite error: rpccompat: context canceled", "errorVerbose": "ping satellite error: rpccompat: context canceled\n\tstorj.io/common/rpc.Dialer.dialTransport:211\n\tstorj.io/common/rpc.Dialer.dial:188\n\tstorj.io/common/rpc.Dialer.DialNodeURL:148\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:124\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-09-29T09:29:48.778Z	INFO	contact:service	context cancelled	{"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2020-09-29T09:29:48.778Z	INFO	contact:service	context cancelled	{"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW"}
2020-09-29T09:29:48.779Z	ERROR	contact:service	ping satellite failed 	{"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "attempts": 1, "error": "ping satellite error: rpccompat: context canceled", "errorVerbose": "ping satellite error: rpccompat: context canceled\n\tstorj.io/common/rpc.Dialer.dialTransport:211\n\tstorj.io/common/rpc.Dialer.dial:188\n\tstorj.io/common/rpc.Dialer.DialNodeURL:148\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:124\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-09-29T09:29:48.779Z	INFO	contact:service	context cancelled	{"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2020-09-29T09:29:48.779Z	ERROR	contact:service	ping satellite failed 	{"Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "attempts": 1, "error": "ping satellite error: context canceled", "errorVerbose": "ping satellite error: context canceled\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:138\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-09-29T09:29:48.779Z	INFO	contact:service	context cancelled	{"Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
2020-09-29T09:29:48.780Z	ERROR	nodestats:cache	Get pricing-model/join date failed	{"error": "context canceled"}
2020-09-29T09:29:48.780Z	ERROR	contact:service	ping satellite failed 	{"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "attempts": 1, "error": "ping satellite error: context canceled", "errorVerbose": "ping satellite error: context canceled\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:138\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-09-29T09:29:48.780Z	ERROR	contact:service	ping satellite failed 	{"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "attempts": 1, "error": "ping satellite error: rpccompat: context canceled", "errorVerbose": "ping satellite error: rpccompat: context canceled\n\tstorj.io/common/rpc.Dialer.dialTransport:211\n\tstorj.io/common/rpc.Dialer.dial:188\n\tstorj.io/common/rpc.Dialer.DialNodeURL:148\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:124\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-09-29T09:29:48.780Z	INFO	contact:service	context cancelled	{"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2020-09-29T09:29:48.780Z	INFO	contact:service	context cancelled	{"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2020-09-29T09:29:48.797Z	ERROR	pieces:trash	emptying trash failed	{"error": "pieces error: filestore error: open config/storage/trash/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa: bad message", "errorVerbose": "pieces error: filestore error: open config/storage/trash/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa: bad message\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:150\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:359\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-09-29T09:29:48.820Z	ERROR	pieces:trash	emptying trash failed	{"error": "pieces error: filestore error: open config/storage/trash/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa: bad message", "errorVerbose": "pieces error: filestore error: open config/storage/trash/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa: bad message\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:150\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:359\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-09-29T09:29:48.842Z	ERROR	pieces:trash	emptying trash failed	{"error": "pieces error: filestore error: open config/storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa: bad message", "errorVerbose": "pieces error: filestore error: open config/storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa: bad message\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:150\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:359\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-09-29T09:29:48.864Z	ERROR	pieces:trash	emptying trash failed	{"error": "pieces error: filestore error: open config/storage/trash/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa: bad message", "errorVerbose": "pieces error: filestore error: open config/storage/trash/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa: bad message\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:150\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:359\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-09-29T09:29:48.886Z	ERROR	pieces:trash	emptying trash failed	{"error": "pieces error: filestore error: open config/storage/trash/6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa: bad message", "errorVerbose": "pieces error: filestore error: open config/storage/trash/6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa: bad message\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:150\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:359\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-09-29T09:29:48.908Z	ERROR	piecestore	upload failed	{"Piece ID": "CSOJEWUHGAZXQNFITBNZSLRJMA55HNZWY3MAOO23JLXWFORLF2BQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "error": "pieces error: filestore error: open config/storage/temp/blob-944619741.partial: bad message", "errorVerbose": "pieces error: filestore error: open config/storage/temp/blob-944619741.partial: bad message\n\tstorj.io/storj/storage/filestore.(*blobStore).Create:166\n\tstorj.io/storj/storagenode/pieces.(*Store).Writer:210\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:290\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:996\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}
Error: pieces error: failed to enumerate satellites: readdirent: bad message

BTW I can fetch https://version.storj.io/ : my ISP is not blocking it, and 28967 is opened.

jeremyfritzen · September 29, 2020, 10:06am

Just few hints:

what is your node status (you can see it on the GUI or CLI dashboard)?
are you sure your USB device is mounted on your machine (if you are on Linux, there a lot of chances that you need to manually re-mount your disk)?
did you try to restart the docker container?
did you try to restart your whole server (the one that hosts the Docker container), after plugging back your HDD?

mattamore · September 29, 2020, 10:10am

node status is offline on the web:14002, and unreachable through gui (need to wait for the storj boot to be completed)
sure mounted and running on ubuntu 20.04, already tried to umount/remount several times
I tried to restart the container and even the server, still the same issue

nerdatwork · September 29, 2020, 10:16am

Check your disk for errors using command fsck

mattamore · September 29, 2020, 10:19am

in progress, it seems the disk has errors
finger crossed!

jeremyfritzen · September 29, 2020, 10:26am

Please keep us posted
If your disk has errors, it’s not a good sign

I hope your disk is still OK and you just have a data corruption. In this case you may probably need to recreate a new identity since your node may be DQed

mattamore · September 29, 2020, 10:30am

fsck fixed some errors and my node is up and running again!
Thank you very much

I hope I won’t be disqualified in the next few days then.

mattamore · September 29, 2020, 1:46pm

Just got a mail of diqualification
Any chance that I can host data on this node in the next few days?
Or I need to start a new node from scratch with a new identity?
It’s pretty rude since I was online for 5 months without a second of downtime before this unfortunate event

nerdatwork · September 29, 2020, 1:54pm

Disqualification is permanent and you have to start from scratch now. Are you disqualified from all satellites ?

mattamore · September 29, 2020, 2:06pm

only one for now (europe north, the main one for me)

andrew2.hart · September 29, 2020, 3:17pm

Could you check that the node id in your email DQ match the output of the dashboard DQ (Just for my interest)

SGC · September 30, 2020, 6:06am

you shouldn’t get DQ for just accidentally unplugging your hdd…
you sure thats why it got DQ…

Alexey · September 30, 2020, 6:58am

You can. For example:

the automount options (GUI), it’s just mount a disk to a different random mountpoint;
mounting via device name (i.e. /dev/sda1 instead of UUID);
not mount option in /etc/fstab at all;

Using the root of the drive for storage with one or all above - thus data stored in mountpoint (i.e. on your system drive) and then hided when the actual mount is happening (common for the Unraid, because they mount disks to userspace after the full load, when the docker already started).
And of course, hardware problem - the disk was unplugged abruptly, so data can be corrupted. Depending on filesystem it could be heavily corrupted - ExFAT, for example.

mattamore · September 30, 2020, 7:35am

I’m DQed on all nodes now. Yes my data was corrupted and fsck changed a lot of things. That’s why I think.
Will start another node today and will be more careful with my electric plugs…
Thanks anyway! Great to know the community is here to help

SGC · September 30, 2020, 7:48am

tho i’m on linux, i haven’t really tried the fsck
but i know that in windows i always avoid running the checkdisk thing like the plague…
i’ve often seen it totally … over a filesystem… i’m sure it’s doing something useful for some stuff because it can at times fix a disk that won’t boot and such… but to be fair i’ve seen it mess with data and directories it shouldn’t more than it’s helped me.

maybe you can fix the data if you haven’t deleted it already… modifications to a filesystem could very well be reversed or something… but i duno, just saying it may be an option

ofc if you are DQ it’s not much help to recover the data…
really don’t let tools run on your drives if they don’t have to …

would be interesting to know if it was the disk or the repair tool that broke the data…
because if it was the repair tool that broke it, then thats a danger for most SNO’s

Alexey · September 30, 2020, 8:11am

The data is already corrupted. Do you want to have unknown bugs in the future and wonder why is it happened, spent money on hardware and then figure out that filesystem was just simple corrupted and you could save a lot of time and money if you did a fsck or chkdsk in time?
It’s better to figure that out right now, than in some time in the future. Fixing the past errors is way costly than in time.

SGC · September 30, 2020, 5:27pm

i just know that i’ve had more times chkdsk just ruined my data rather than fixing it… and afterwards its all weird file and folder names… and everything worked fine before chkdsk ran…
so i basically stopped using it a long long time ago, and never really looked back… never really had problems doing without it… i mean disks go bad at times… thats just how it is… chkdsk starts automatically and then in 30 sec its changed the names of 50000 files
thats just not optimal for my usage, never did get around to figuring out why it would do that…

but i’ve seen it happen across many different computers and over vast spans of time, i just always turned it off when i got my hands on a windows machine… never seemed to give me any grief doing without it on like close to maybe a hundred machines over a decade or so…

but if i don’t have any important data on the drive it can sometimes fix a filesystem so a windows will boot… but thats about all i dare use it for…

chkdsk sucks imho, but atleast chkdsk isn’t my headache anymore… and nor is fsck since now i’m on zfs and it has it’s own stuff for that… which seems to do a great job thus far.

jeremyfritzen · September 30, 2020, 5:41pm

That is why I set up the following configuration for my HDD (i’m using Linux):

each disk is mounted with UUID instead of standard path (e.g. “UUID=b14382af-cff4-4d80-b1e8-1f36b546240d” instead of “/dev/sda1”)
each disk is mounted on a dedicated directory in the “/media” directory (for example: /media/storj1)
in each disk, there is 2 directories: storj1_data and *storj1_identity**.

This way, if the Storj container restarts without the disk attached (or attached at a wrong place, or a wrong HDD is attached to the directory), the container should stop without trying to connect.