Database is locked. What is the reason? What is the possible solution?

Pentium100 · May 14, 2020, 2:13pm

It could be an option. Depending on how the node is written, adding another DBMS could be easy (or not).
Redis would be a bit more difficult to add, but probably would perform faster.

SGC · May 14, 2020, 6:36pm

actually i think this new change should fix the issue with locking db…
from what i can see its closely related to the deletion / cleanup process which seems to eat all concurrent accepted connections from the node… which is why it looks exactly the same and when i run into my max concurrent limit during start up… or deletions
so imo this will fix that issue if the query is deleting slowly enough and not just spamming hundreds or thousands of files a second,.

Delete Queue
To speed up file deletion for the customer the satellite is sending the delete messages as quickly as possible to the storage nodes. The storage nodes will acknowledge the delete messages as quickly as possible without executing them. . It will store the delete messages on a queue and delete them afterward. The queue has a limit.

Odmin · May 14, 2020, 7:06pm

I know at least 3 way how to solve this issue without ZFS and switching database engine and SQLite tweeking
Another way is to relocate DB’s to SSD, but we already did it.
If you are interested in POC, please create a new topic in the right place and ask community for help. At least one person will help you

SGC · May 14, 2020, 7:38pm

would be a pretty neat trick to treat a symptom if the cause would vanish afterwards… i doubt the database is the issue in this case tho… i think it’s the deletions running immediately or filling up the concurrent transfers so that the database stuff times out…(due to lack of bandwidth or IO ,) if so then, nothing one does to the database will fix it… because the problem exist outside of the database, one could raise the timer on how long the database allows, but that is a crutch at best and most likely just causes other issues.

littleskunk · May 14, 2020, 9:06pm

Short update from a meeting with the developers today about the db locking:

The piece size is way too small and causing an IOPS bottleneck that we need to fix. I am not sure about the details but the long term goal is to write bigger pieces to disk. That should improve performance.
Why do we store serial numbers for 48h in a DB? We should keep that in memory and don’t write it to disk at all. That will solve that problem entirely.
We still want to write orders into a DB. If the node is restarting we can drop the serial list but we can’t drop unsubmitted orders.
Something is wrong with our sqlite implementation. Why do we have to call vacuum? That shouldn’t be needed. Autovacuum should also not be needed. It could be a problem with the go sqlite driver. At that point, I faded out. Too many details for me. I was sold on the in memory solution already.

Toyoo · May 14, 2020, 10:14pm

This feels like a plain append-only file, cleared after every submission, would do. Maybe even just a text file?

SGC · May 15, 2020, 10:29am

[this got cut and redirected to alexey last night, should have ended up here…]
i watched something quite interesting about optimization a while back… amazing stuff… ill see if i can find it, because you should really see that…
mind blowing.

sounds good that you guys are looking into that, it’s a real issue… i got 9 drives… trying to keep up with 1 node… and performance dropping if i look at it the wrong way… hold my byte…OH SO HEAVY

well double the piece halves the io basically… but thats not always a good thing… like in our zfs discussions… i’m no expert in the disk storage area… but it seems to me like there is something utterly wrong that i cannot squeeze through more pieces than what i am doing.

must be something with how its ordered and written out to disk… to be honest it does feel a bit like there is no caching or structure to it… ofc thats not really what one starts thinking about when one starts to build stuff like this…
maybe it just throws data around way to randomly… i mean… egress we don’t see a ton of… so thats not really killing the system and i got a slog / write cache
so why can’t i handle more than 5mb a second ingress…
pretty sure i could saturate my 1gb network connection in data transfers for weeks without it dropping much… even if i used the server for a ton of stuff… which is what is so damn weird…

i got 9 hdd’s 2vdevs raidz1 which load balances between them… i got a dedicated OS SSD / SLOG (write cache) i got another dedicated 750gb ssd for l2arc which takes care of anything repeating jobs so they will have 1-3ms latency, 600gb allocated for l2arc, 48gb ddr3 ram, dual 4 core 8 threaded xeon cpu’s… i optimized my recordsizes which granted seems to help a bit… i optimized my zfs best i can.
im running docker on baremetal, my storage is on baremetal same system…
granted i’m new on linux but i’ve been working with related stuff for decades…
my vm’s run real nice now… never had a faster system lol but i’ve been trying to optimize and get all of this working for two months.

and don’t get me wrong it works, pretty well… but if this is what it takes to just barely keep up… xD
in my opinion it has to be how its dealt with the data somewhere… something that does that hdd’s slow down… which is basically random io… not sure what i can do at Q1T1 but its not much above 10-20mb sustained. if that… so really thats what i would assume it is… not enough sequential write operations, so the program lets the host system figure out how to write stuff and read or something like that… i duno… my guess…

how difficult can it be to download 5mb/s and write it sequentially… or the database refuses to live in memory… but that should just move to my l2arc…with enough time… but then again the system did run fairly smooth the last time i got up to 3 days run time… but have had a lot of trouble lately… but think thats over now… finally… maybe i can get some sleep then lol

your turn to have sleepless nights i guess lol

tried to find the lecture on optimization with no luck…

basically it was this guy that wanted to do some programming, ended up doing decompression optimization for rust manuals i think, because they took forever on his laptop.
ended up taking it all the code apart and eventually optimized it to run 8000% faster, just because it did some bad moves with how it did its IO disks and such…

kinda put it into perspective how much it matters the method of operation.

Pentium100 · May 18, 2020, 5:13am

Just failed an audit because of the locked db. “Fun”.

Couldn’t the node just retry accessing the database if it is locked?

Also, will the logs be fixed to indicate the real upload success/cancels instead of randomly reporting a successful upload as canceled?

2020-05-13T00:45:11.532Z        INFO    piecestore      upload started  {"Piece ID": "M4OYMIMD53ZHHQ22B55AASTH5FENGHWPGDH3P7B5SD7ZZEUVDOQQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "PUT", "Available Space": 869332106624}
2020-05-13T00:45:14.717Z        INFO    piecestore      upload canceled {"Piece ID": "M4OYMIMD53ZHHQ22B55AASTH5FENGHWPGDH3P7B5SD7ZZEUVDOQQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "PUT"}
2020-05-18T04:43:08.670Z        INFO    piecestore      download started        {"Piece ID": "M4OYMIMD53ZHHQ22B55AASTH5FENGHWPGDH3P7B5SD7ZZEUVDOQQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET_AUDIT"}
2020-05-18T04:43:18.683Z        ERROR   piecestore      download failed {"Piece ID": "M4OYMIMD53ZHHQ22B55AASTH5FENGHWPGDH3P7B5SD7ZZEUVDOQQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET_AUDIT", "error": "usedserialsdb error: database is locked", "errorVerbose": "usedserialsdb error: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*usedSerialsDB).Add:35\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:76\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doDownload:523\n\tstorj.io/storj/storagenode/piecestore.(*drpcEndpoint).Download:471\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:995\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:66\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

Upload canceled, but my node most likely has that piece.

BrightSilence · May 18, 2020, 7:54am

I’m surprised that happened right now during low traffic on a system I’m pretty sure is highly IO optimized.
I wouldn’t worry about it though, you’re pretty unlikely to see more than one and it’ll take a whole lot more of them to get suspended. This problem is being worked on as well, so hopefully they’ll find a better solution soon.

Totally agree on the upload cancelled comment. I think the problem is that sometimes it is actually cancelled. But many times the piece does end up on your node and you make money on it anyway. I don’t think it’s exclusively a logging problem, it depends on when that cancellation takes place.

Pentium100 · May 18, 2020, 7:59am

It is a logging problem. There are two reasons for “upload canceled” messages:

My node lost the race, it did not get to keep the piece.
My node did not lose the race, but the uplink disconnected prematurely. My node gets to keep the piece and earns money for it.

However, the log has the same entry for both.

Let’s say I set up a new node and kept all its logs. By iterating over uploaded/deleted entries, I should be able to build a list of pieces my node should have (uploaded and not deleted). However, using the current logs I cannot do that, because there is no way to find out if “upload canceled” means my node lost the race or not.

BrightSilence · May 18, 2020, 8:04am

These describe the same scenario. The uplink disconnects all connections when it has enough pieces uploaded. Losing the race would also look to your node as a premature disconnect, the difference being that it happens when your node hasn’t finished its job yet.

I agree that it should be fixed, but I’m almost certain it’s not just a logging problem.

Odmin · May 18, 2020, 8:04am

About Cancellations: please look into issues on GitHub @littleskunk already reported it.

Pentium100 · May 18, 2020, 8:10am

No, they do not. Either my node gets to keep the piece or it doesn’t. The nodes that lose the race do not get to keep the pieces.
However, because this piece was audited after the upload was canceled, it means that the uplink and satellite consider it “success”.

It should work liek this:

if (disconnected_prematurely()) {
	if (piece_complete()) {
		log("uploaded");
	} else {
		log("upload canceled");
		delete(half_uploaded_piece);
}

I’m pretty sure the delete still happens so the nodes are not filled with partially uploaded pieces nobody is going to pay for.

BrightSilence · May 18, 2020, 8:22am

Right, we mean the same thing. It’s not just a log line change, but logic needs to be build to differentiate the two messages based on context and write different log lines.

The RPC stream returns a cancelled error in both scenarios though. And that is simply what’s reported in the logs. So yeah a change of logic is needed in addition to changing the logs themselves.

Odmin · May 18, 2020, 8:34am

I researched this problem a few months ago, I got this result: During the working of storage node we have a lot of writes to the database, this writes increase fragmentation of database files (extremely increase) and we will get performance degradation (a lot of IOPS for simply query). Calling vacuum is not a silver bullet but it really helps because during this operation we rewrite the whole file sequentially (like defragmentation on disk), the same result we will have when just move a database to /tmp and back to the original place (rewrite the whole file sequentially).
So, root cause is - file fragmentation on HDD (not a database structure fragmentation).
I very happy that we have a possibility to move databases on this release to the SSD, for me it best solution for keep database always “defragmented” (we just will not care about defragmentation) I will put it to my mirrored system SSD and will monitor how it working. I expect a lot of benefits at the same time:

Solve issue with “locks”
No database performance degradation during a time (one month with active ingress is enough to heavy performance impact)
We offload main storage for pieces and improve IOPS

silaxe · May 18, 2020, 10:59am

Sorry I may have missed the news. When will it be possible to move the database to a separate disk (e.g. SSD) than storj data? In 1.4.2? If so I would be interested to read a step by step guide on how to do so if it helps to solve the db lock issue.
Thanks!

Odmin · May 18, 2020, 11:05am

I found it on GitHub changelog and asked here

Summary

SGC · May 18, 2020, 11:46am

yeah will be interesting to see what the ability to move the db does, however my money is still on that it is a software thing… i get the db locked thing during clean up only which seems to indicate that the storagenode software spams the deletes at a rate where they might take up all the available concurrent allowed connections and thus uploads and downloads gets a db locked because of timeout or runs into some such intricacies i am unaware of.

separating the deletions or limiting them / making them secondary and or buffered to allow priority for upload and download requests i believe would remove the issue…

but we will know soon enough now…

anon27637763 · May 18, 2020, 12:03pm

Here’s my guess on the “Moving DB” feature… there will be plenty of posts in the forum asking why someone’s node just DQ-ed… and the reason will be because the DBs were moved to someone’s OS drive and the OS was updating automatically, resulting in a database locked audit failure.

I’m all for configurability with plenty of options, but:

“With great ~~power~~ options comes great responsibility”

BrightSilence · May 18, 2020, 12:09pm

Well, I think the baseline advise should be not to use it unless you really know what you’re doing. It’s good that the option is there to help more advanced users out in scenarios where having the db’s on the data disk wouldn’t work. But if your node is working fine, you really shouldn’t be touching this option.