Failed deletions? files use space forever?

mangoo · May 3, 2020, 3:10am

Are failed deletions handled in any way? I have several failed deletions due to “database locked”.
The log doesn’t show the deletion was retried:

2020-05-02T09:51:56.084Z        INFO    piecestore      upload started  {"Piece ID": "IW52PEPNDIL3SZLXSBMGKC5EUOAXXGGX4OZGMD3247QL5RLFODGQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Available Space": 433621812560}
2020-05-02T09:51:57.396Z        INFO    piecestore      upload canceled {"Piece ID": "IW52PEPNDIL3SZLXSBMGKC5EUOAXXGGX4OZGMD3247QL5RLFODGQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT"}
2020-05-02T09:53:35.429Z        ERROR   piecestore      delete failed   {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Piece ID": "IW52PEPNDIL3SZLXSBMGKC5EUOAXXGGX4OZGMD3247QL5RLFODGQ", "error": "pieces error: v0pieceinfodb error: database is locked", "errorVerbose": "pieces error: v0pieceinfodb error: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*v0PieceInfoDB).Delete:163\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:286\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).DeletePieces:190\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func4:1012\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:66\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

nerdatwork · May 3, 2020, 3:15am

How is your HDD connected ?

BrightSilence · May 3, 2020, 10:18am

They will be caught in garbage collection. But you won’t see individual log lines for that I believe. So yes, they will be cleaned up.

The error does suggest an IO bottleneck though. So it would be good to look into what could be causing that. Could be USB2 connection, remote storage or SMR drive for example.

mangoo · May 3, 2020, 12:22pm

It’s 4 HDD disks (to my understanding, non-SMR - Seagate ST10000NM0156) in RAID-5 (connected with SATA).

SGC · May 3, 2020, 12:39pm

that’s a database is locked error…
usually means something with user rights on the data, or that they where briefly inaccessible…

maybe your files are read only? i think might be a good guess… if not for you, then for the user that runs the storagenode, if you ain’t running default.

mangoo · May 3, 2020, 1:10pm

“database is locked” - I think this happens when there is a high IO and the disk access is slow.

anon27637763 · May 3, 2020, 1:18pm

It also happens when the databases have not been “vacuumed” often and the computing platform has a low spec-ed processor and/or not much RAM.

I would suggested stopping the node, vacuuming the databases, and restarting the node.

SGC · May 3, 2020, 1:26pm

it could be, certainly some sort of issue you need to deal with… i would start by verifying the things that are easy to deal with, like user rights, and deletion permissions on the files…

making your node faster / higher io is a bit more of a troublesome issue to deal with…

high io seems unlikely imo, i mean then you should get the same on uploads and downloads… deletions should be pretty low demand compared to download and upload, ofc they might have different priority…

also i would bet that deletions are not live, they go to the disk cache and then is done when it has time… which i suppose would sort of suggest you could be right… but on 4 drives in raid 5… then people with 1 drive would be ****ed.

you should have a good deal more IO that they have, and then there is the whole raid, depending on what you are using for the raid, something of that would most likely have a cache that would just confirm a deletion and deal with it later…

you should have like 800 iops write or something along those line before your array would even start to give you high latency… and i’ve run with nearly seconds of backlogs on my drives at times…
without any issues… i’m on basically a raid 5 with 5 drives just using ZFS

mangoo · May 3, 2020, 1:44pm

The server has 32 GB RAM and Intel(R) Xeon(R) CPU E3-1270 v3 @ 3.50GHz (8 cores).

How it the database “vacuumed”? Is it a manual process?

anon27637763 · May 3, 2020, 1:52pm

sqlite3 orders.db "VACUUM;"

EDIT:

All databases should be vacuumed once per month or so… more often with more traffic.

If you are running Docker on GNU/Linux, this can be automated quite easily using cron and some simple BASH scripts. Here’s a simple script that should check vacuum all the databases.

dbs=$(ls ./storage/*.db)
c="VACUUM;"
for i in $dbs
do
    sqlite3 $i "$c"
done

mangoo · May 3, 2020, 2:06pm

Is that reflected / recommended anywhere in storj documentation?

The above script will corrupt the database if I’m not mistaken?

Once, the database is opened for writes by storj process in docker. Then, we write to the database from a separate sqlite process. Sqlite does not support that - it can’t end well?

anon27637763 · May 3, 2020, 2:11pm

As indicated in my prior post…

Stop the node
Then vacuum the databases
Then restart the node.

You can also add a "PRAGMA integrity_check;" command loop to ensure that the operation was successful and the databases are still intact.

Vacuuming a database should not corrupt it… the operation purges the db file of old and unused data that may have been kept in the db as a result of numerous row modification operations.

For those who may not understand what’s going on and why this VACUUM procedure may be necessary, you can read about it on sqlite’s documentation page.

SGC · May 3, 2020, 2:19pm

i wouldn’t do any changes to my database without atleast getting a “should be fine” from @Alexey
the database is pretty critical infrastructure of your node and should not be tinkered with lightly…

not that i’m saying it cannot be done, or that beast isn’t right… just saying i would take a bit of convincing to go in and manually change my storagenode database…

but i’m sure @anon27637763 knows what he is talking about, else i would hope he wouldn’t suggest the operation…

Pac · May 3, 2020, 10:18pm

Feels to me like these housekeeping operations should be automated. Personnally, I’m expecting nodes to be “set it and forget it” softwares…

I don’t wanna have to look after them at all times, especially if this involves monitoring the logs!

joesmoe · May 3, 2020, 10:45pm

I agree.

They should do something about the logs as well, integrate log rotation or something.

Also why email notifications should be re-enabled and accurate.

If our target node isn’t a sysadmin monitoring a datacenter of servers, we need to do as much as possible to help people know when there’s a problem and how to rectify it.

SGC · May 4, 2020, 9:28am

i was also annoyed about the log thing, so i made this…

kinda new to linux, but couldn’t find anything else that allowed me to keep the functions of docker logs and to allow me to use the default scripts.

ended up being a bit of a long command, but essentially it’s two lines which i haven’t been able to make into 1

it’s all there if you expand the quote, i would love some feedback on it.

BrightSilence · May 4, 2020, 5:59pm

I am no Alexey, but vacuuming a db while the node is stopped is fine. But I would only do it if the db is causing issues for you. Ideally I would like to see vacuum becoming a part of the database migration steps with each new node version. End users shouldn’t need to worry about it at all. I have seen no significant difference in size and performance after vacuuming. But then again, I’m testing on an SSD accelerated array, which may smooth over any issues to begin with.

anon27637763 · May 4, 2020, 6:14pm

My orders.db typically shrinks significantly after vacuuming:

Using my posted script in the other similar thread along with an added ls command to check size.

storagenode
-rw-r--r-- 1 root root 162738176 May  4 14:08 /opt/storj/storage/orders.db
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
-rw-r--r-- 1 root root 162443264 May  4 14:08 /opt/storj/storage/orders.db
storagenode

Difference:

162738176-162443264 = 294912

That’s 288 KB recovered after 6 hours of run time.

cdhowie · May 4, 2020, 6:18pm

288KB of a 162MB file is the opposite of “significant” especially factoring in a 6-hour run time during which your node is offline. It’s a reduction in the total file size by only 0.1%! That doesn’t seem worth the trouble.

anon27637763 · May 4, 2020, 6:21pm

Database size increase is not a linear function of time running.

It’s a function of how one’s hardware works as well as how much the database is being used.

At the end of April, the orders.db was 320 MB.