Disk Full But Dashboard Says It's Not

Hi everyone,

I just noticed that my disk is now completely full, but the Node Dashboard says it still has space. My sqlite3 database was corrupted and fixed before, maybe it didn’t count all the files on my disk (i.e. some files exists on disk but are not discovered by STORJ)?

I’m using Podman running the Docker image. My disk space is 16TB (14.6TiB) and if I set STORAGE = 16TB it will automatically show total disk space of 15.6TB, so I have set it to STORAGE = 15.5TB to give it 0.1TB space of wiggle room. However even with that, my disk is still completely full (literally, 0B left).

The whole partition is for STORJ so there should be no unrelated files.

Any ideas?
Stephen

Here is the result of df -T.

Recommendation is to leave 10% free space. This might be too much with a 16TB hard disk. I would suggest to assign 15TB to you node.

Edit: To get some space you can set the reserverved space to 0% with tune2fs. By default it’s 5% and only root has access to it.

I see, considering 5% of reserved space, it should only have 16 * 0.95 = 15.2TB space available. I guess that explains why Used + Trash = 15.35TB and the drive is full.

I have set it to 15TB and I’ll continue to monitor the drive space.

Thanks for your insight!

But just out of curiosity. Is it possible that some files are not in the database and hence not calculated by STORJ of their sizes? And is there any way to let STORJ validate all the files on disk against its database?

There are currently some issues (that I know of, blobs moved to trash are not updating the databases with 1.109) with the used size in the dashboard that are currently getting fixed, therefore it is possible for the real vs dashboard values to differ, and even possibly deviate further as time goes on.

If you want to “resync” these values for the time being, you can enable the start up file walker (if you had disabled it), and restart the node. This will trigger for the node to rescan all files stored in the hard drive and update the dashboard values. This process can take a long time depending on the amount of blobs stored.

Keep in mind that these discrepancies between used space in the dashboard and actual used size in the disk only have an effect on the dashboard itself and not in calculating payouts.

Incorrect. If the node thinks that it is full, then it reports to the satellite that it is full (=no new data coming in).

I think that is somewhat true, since I haven’t had a single byte of ingress for 7 days already…

I have never on purposely disabled it as I had not heard of it before.

Do you know which environmental variable do I need to set for a docker container?
Here are some of the evs I have that are not identity related:

VERSION_SERVER_URL=https://version.storj.io
AUTO_UPDATE=true
SETUP=false
GOARCH=amd64
LOG_LEVEL=
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
container=podman
SUPERVISOR_SERVER=unix

i also would probably allocate 15TB out of your 16TB, because I have had disks fill up as well.

Interesting, the storj node is SUPPOSED to abort a write if the disk has less than 5GB of space. Did your node experience any special kind of crash? any exoticness in the disk set up? (it looks like plain ext4). It would be cool if we could track down why the 5GB safety buffer fails sometimes.

and yeah, running and finishing a used space filewalker (which could take weeks) on that node will get you a more accurate usage figure.

also if your disk is 0B full, then you need to free up space. As an emergency I would go into a /storage/trash//and pick the oldest date available and delete some files in that directory or the whole directory. Yes it could lower your audit score, but it’s also data that is purged after 7 days.

My bad, thats the only caveat. If your node incorrectly believes its full and therefore stops data ingress, it will result in your node losing out on more data and therefore decreasing payout.

On the other hand, it is still true that the data displayed in the dashboard under the space used is not directly related to the calculation of payouts.

If you havent disabled it, there is no need to change any environmental variables.
In the case that you would like to disable the start up scan (as by default it is enabled), you can do it by editing this line in the config.yaml file. There might be an environment label to achieve the same result but I cannot recall it.
# storage2.piece-scan-on-startup: true

If you restart your node you should be able to see this process running, indicating that a scan is in progress: /app/storagenode used-space-filewalker. This is how it appears under my system, it may look a bit different in yours, but looking for “used-space-filewalker” should be good enough. How long this takes will depend on a lot of factors. Personally they usually take a few days, but it may be faster or slower depending on your system.

The behaviour my nodes experience is that while nodes aim to stop ingress when disk space is within 5GB, for some reason it sometimes takes a bit longer until the node stops receving data, having gone as much as a couple hundred gigs into overused storage.

Im not 100% sure, but I also believe this 5GB threshold only comes into effect regarding the amount of data the databases reports as used. If your used space is lower than the actual used space, it may go on to use more space than was allocated.

STORJ_STORAGE2_PIECE_SCAN_ON_STARTUP=true or as a command line argument after the image name --storage2.piece-scan-on-startup=true.
However, it is enabled by default, so you don’t need to specify either the variable or the command line argument.

1 Like

The node didn’t crash before. My disk setup is a 4-drive SSD RAID10 and yes it’s plain ext4.
What I did was, migrating from my HDD RAID into this new SSD RAID. The procedures were as below:

  1. I kept the node running
  2. I used rsync -a --inplace --ignore-existing HDD_RAID SSD_RAID to copy the files to my new drive and it took around 6 days
  3. After it finished, I ran the same command again but this time only took less than 1 day.
  4. I shutted down the node
  5. I ran the same command once again to copy all the remaining files to the new drive and it took around 3 hours.
  6. I re-mounted the new drive to my docker image and started the node.
  7. After a couple of days I noticed that my drive is full

Here’s my suspicion (and the reason why I wanted to see if there’s any way to verify the files agains its database): while I was copying the files (6 days of time), some files were deleted by the node, but they were already copied to the new drive. After I did the last rsync, the newest database got copied to the new drive, and hence it thinks those files were deleted but in reality they still exist on the drive.

Is this kind of situation possible?

yep the last rsync you run you need to use the --delete option to delete files on the destination that shouldn’t be there. it’s even on the official documentation page: How do I migrate my node to a new device? - Storj Docs

So… it’s been a couple of days, I wouldn’t try to go back and rsync anything. running bloom filters and garbage collection will gradually, over the course of several days, delete the files that “shouldn’t be there”.

1 Like

It also now may disqualify the node, because it’s likely got new files since then and this rsync command will delete them as well.

@m1nicrusher I would recommend checking your databases as they may have been corrupted due to an incorrect last rsync.

I see. That makes more sense now. It was on me that I did not noticed the --delete option. Luckily now it looks like it’s running without a problem after a few days.

1 Like

Thanks for the suggestion. I’ve run integrity check on all the db files and all I got is Page 25 is never used in storage_usage.db and no error otherwise. Looks like it’s been running well for a couple of days now.

I would suggest to fix that database, otherwise your usage on the dashboard would be likely incorrect. You may also re-create it, the used-space-filewalker should update it with correct values in the end, but the usage will be off until it finish.
If you would like to recreate this database, you need to follow this guide: