Storagenode keeps writing data after mounted drive lost

node1 · February 6, 2020, 10:49am

Two cases - same issue.

Situation 1: Looks like there is something wrong with external HDD or enclosure (don’t sorted out yet). The problem is that once in the 1 or 2 months, RPi looses the mounted folder. It only helps when the RPi is restarted. After restart mount comes back all OK.

Situation 2: RAID6, lost 1 drive (out of 10), and also lost mounted folder. It should not be like this, RAID6 can lost 2 drives and should keep working, have no idea why, but it’s not about it. Ubuntu runs on mirrored SSD’s, so the system keeps working fine. But without mounted folder After restart same story as with RPi, disk came back, mounted point as well. All working fine.

BUT….

In both scenarios docker/storagenode keeps running. But instead of writing data to the external drive / raid, it starts from zero writing to the RPi’s sd card, and to the RAID1 where ubuntu is installed. Mounted folder (when the drive is not mounted) is always empty. But after these two situations i see that there are new DB’s with the new files. Also zabbix showing growing data from 0 again.

Storagenode instead of detecting failure, keeps writing data to the wrong folder (well it’s the same folder, but not the same drive). This will lead to the problem. Because once the satellite will try to retrieve data that was written to the SD card / raid1 , instead of mounted external drive/raid6 it will not find that data.

In my opinion storagenode should always check this and stop working after the mounted drive is lost. Because instead of 2 db’s i have now 4db’s

How to deal with it?
Will it 100% will be disqualified in the future, once it will not find that data, written to the sd card/raid1 ? Should i kill these nodes my self and start again or is there any way to merge the data?

Pentium100 · February 6, 2020, 11:19am

Try creating a folder inside the mounted drive and telling the node to use that folder for its data.

Do this for example:

mount /dev/sda1 /mnt
mkdir /mnt/node
docker run -d --restart unless-stopped ... \
--mount type=bind,source="/mnt/node",destination=/app/config

Now when the mounted drive goes away, the folder will go away too and the node will encounter errors because it tries to write to a non-existent folder.

node1 · February 6, 2020, 12:04pm

HA HA HA so simple workaround

What about that data, that is written to SD/RAID1 ?
Will it make me troubles in the future, once the data will not be found?

Pentium100 · February 6, 2020, 12:22pm

Yes. I do not know how you can merge the two databases, but if you do not then you will start failing audits.

baker · February 6, 2020, 1:57pm

How much data are we talking here? A few MB, you might get away with it. Hundreds of MB and you will likely be disqualified. Personally I would try running it (with your original data/db) and see what happens. I think there is still some leeway prior to production release.

KernelPanick · February 6, 2020, 2:44pm

and when it encounters errors because the folder doesn’t exist, does it shut down the node safely?

I’ve thought about adding a check to periodically see if the path is available. If the path fails, a script would run to shutdown the node until it can be restored.

Pentium100 · February 6, 2020, 3:13pm

I have never tried it. It should, at least, return an error to the satellite because it is unable to write or read a file. Which means that you would be failing audits while the drive was not mounted.

I do not really want to test this on my node, or create a new node just for testing this.

node1 · February 7, 2020, 9:19am

It is not that old node. Currently it have ±40Gb, while with missing drive it was created 2-3Gb on SD card But the node runing fine with reconnected drive. Let’s wait and see

BrightSilence · February 7, 2020, 10:35am

I would not want to continue with a node that lost 2-3gb out of 40gb total. Pretty sure it won’t survive that.

kevink · February 7, 2020, 10:41am

Better to start again with a new identity than get DQed in a month. With 40GB your node is probably not older than 1-2 month?

Alexey · February 7, 2020, 11:47pm

You can merge blobs folder from the mountpoint to the drive.

Stop the node
Unmount drive
Mount drive to other temporary path
Move blobs to the drive
Unmount temporary path
Mount back to the place

You can also merge databases - unload data from a new database and load it to the old database.
How to unload and load data is described here: https://support.storj.io/hc/en-us/articles/360029309111-How-to-fix-a-database-disk-image-is-malformed-

Make backup of databases before processing.

KernelPanick · February 9, 2020, 11:16pm

I created a script for iSCSI users, or users having network storage /mnt. The script will check in every 60 seconds to see if the folder is available. If the folder is no longer up, the storagenode container will be shut down - until the folder returns. Once the folder returns, the storagenode will be started back up.

If you plan on doing maintenance, run chrontab -e and comment out the script to stop it temporarily. You may also want to make watchtower stop the script if it’s taking the node down for maintenance

node1 · February 20, 2020, 1:19pm

Thank you KernelPanick. Script’s are good, but i’m trying to have configuration as simple as possible. If there is a way to avoid scripts by moving data in to subfolder inside of mounted drive, this will be easier. The more scripts, the more things to manage and administer unless it is have some functionallity that is much better then just moving data to subfolder.

KernelPanick · February 23, 2020, 8:38pm

The problem with moving it to the subfolder that you are monitoring, is that the script will stop if that drive goes down. So your node will keep running while your data is missing, leading you to DQ. It must be monitored in an environment isolated away from the components you’re monitoring. (host vs, storj container & iscsi drive). If you have any better idea i’ll consider moving it. I’ve been running this and tested it during a NAS upgrade and host restart. Seems to perform well. It may start the node too early which will continually restart until the full filesystem is available, but this doesn’t cause any DQ risk. I need to find a better files to monitor. Currently it’s looking for the existence of the YAML file.