Bug: SN gross space limit overstep

zyrex · June 30, 2020, 11:29am

I’ve been “struggling” half of june, 5TB allocated, but ended up at -150GB free around the 15th.
storj is on a 13TB drive, so nothing wrong is happening, but would’ve preferred it to work right. Trash and garbage-folders add up to about ~20GB.
Wondering about allocating a bit more space now…

Storgeez · June 30, 2020, 12:49pm

Finally someone admitting to have the same problem!

NikolaiYurchenko · June 30, 2020, 2:54pm

Hi @Storgeez.

Thank you for reporting this issue. This is a known one. We are currently working with fixing it, so hopefully fix will be merged in the nearest future.

Have a nice day.

Storgeez · June 30, 2020, 3:46pm

Oh finally an official response, great! Thanks!

Storgeez · September 10, 2020, 11:51pm

Any updates on this?

Storgeez · October 23, 2020, 2:45pm

How about now?

Still using about 170 GB more than advertised.

Alexey · October 23, 2020, 5:19pm

It will not be reduced until customers remove their data. The fix supposed to do not allow such issue anymore.
And since there is no cases for the last month, we can assume that this is fixed.

But your node will keep an overusage until customers would delete their data.

Storgeez · October 23, 2020, 6:28pm

I don’t understand, I’m not asking for customers to remove their data, I’m asking for the SN to account for the used storage properly. I’m fine with the extra used space, I have enough for 200 GB extra, but I want the dashboard to reflect it properly and not hide it/say I’m using 200 GB less than I actually am.

Alexey · October 23, 2020, 9:42pm

What the filesystem on your drive?
What is reported by web dashboard now?
What is reported by CLI dashboard now?
How much allocated in the config (binary version) or STORAGE option (docker version)?
What the actual used in blobs? (in SI units)
What the actual used in trash? (in SI units)

du -hcd 1 --si /mnt/y/storagenode/storage

What the size of all databases? (in SI units)

du -hc --si /mnt/y/storagenode/storage/*.db

Storgeez · October 24, 2020, 7:23pm

Filesystem is ZFS.
Web dashboard is reporting 7.15 TB total, 6.98TB used, 164.80 GB free, 1.35 GB trash.
CLI dashboard is reporting 6.98 TB used, 164.80 GB free.
Allocated space is 6.5 TiB.
“du -hcd 1 --si” for blobs folder returns 7.1T.
“du -hcd 1 --si” for trash folder returns 1.6G.
“du -hc --si” for *.db returns 964M.

Alexey · October 24, 2020, 9:04pm

7.1 TB + 1.6 GB + 964 MB = 7.102564 TB
7.15 TB - 7.102564 TB = 0.0474360000000003 TB or 47.436 GB is actually free in your allocation.
So, your local database is missed 117.364 GB used on your disk.

How much space used by Stefan satellite?

du -hs --si /mnt/y/storagenode/storage/blobs/abforhuxbzyd35blusvrifvdwmfx4hmocsva4vmpp3rgqaaaaaaa

Storgeez · October 24, 2020, 9:56pm

Well, more than that due to low precision output from the command but, yeah, approximately.

94MB. Why didn’t that get deleted if the satellite is no more?

Alexey · October 25, 2020, 8:37am

We can try to fix the issue with the local database, but I’m not sure will it help or not.
We currently do not have any auto fixing tools, since the database corruption percentage is low.

Stop the storagenode
Create a backup of piece_spaced_used.db database
Remove the piece_spaced_used.db database
Execute either with a local sqlite3 (make sure that version is not older than v3.25.2), or with a docker version (see https://support.storj.io/hc/en-us/articles/360029309111 for reference), specify correct path to piece_spaced_used.db:

sqlite3 piece_spaced_used.db

When you see a sqlite> prompt execute this script:

CREATE TABLE versions (version int, commited_at text);
CREATE TABLE piece_space_used (
                                                total INTEGER NOT NULL DEFAULT 0,
                                                content_size INTEGER NOT NULL,
                                                satellite_id BLOB
                                        );
CREATE UNIQUE INDEX idx_piece_space_used_satellite_id ON piece_space_used(satellite_id);
insert into versions values(29, datetime('now', 'utc'));
insert into versions values(30, datetime('now', 'utc'));
insert into versions values(31, datetime('now', 'utc'));
.exit

Start the storagenode
Let it work with the disk (it could take a few hours for full tree traversal);
Check usage on dashboards

Storgeez · October 25, 2020, 10:52am

I thought the shutdown of the satellite means all data from that satellite gets deleted on the nodes, not all the pieces individually.

But regarding the piece_spaced_used.db, this holds the sum of the space used by all the pieces from individual files on the drive instead of from the piece database?

The last corruption I had was a year or so ago and it was fixed and I havent had any ungraceful shutdowns since then. Unclear what happened to the database. I ran the command as you requested, rebuild in work now.
It seems excruciatingly slow at the moment, running at 1TiB/month, hopefully it speeds up, I cannot stop uploads the normal way, I don’t want to break something.

I will report back when done.

Thanks.

Alexey · October 25, 2020, 12:43pm

For the zfs it is normal. It do traversal through all disks in the RAID.
If you mean the load to the network, then there is no any predications from my side

Storgeez · October 25, 2020, 2:21pm

Yes I mean the database summing. This is faster than a single drive, or something’s not quite right, the disks aren’t even fully loaded, they’re just above 50%. ETA 7 months. This process resumes normally after restarting the node?

It is holding at around 50 IOps, which is what I normally see after restarting the SN. Perhaps some startup task is still running.

And why can’t it just pull the data out of the main database that holds metadata?

Alexey · October 25, 2020, 3:31pm

It should check the actual space usage. When it used the information from your local database with pieces info, you saw a discrepancy.
So, let it run to check all pieces. Please, do not restart the storagenode until it finishes.

Storgeez · October 25, 2020, 3:46pm

Oh wow it finished already, okay. It might’ve been showing the space used by data I received during the check in the SNOBoard.

I think there’s still an issue, running du now to check how much actual used space changed. SNOBoard reports 7.01 TB used.

Storgeez · October 25, 2020, 5:23pm

Okay, du finished. Total usage of storage folder is 7.1655 TB, while the SNOBoard says 7.01 TB.

Alexey · October 25, 2020, 9:19pm

In which folder the excess 0.15 TB?