Failed to add bandwidth usage

tre4orbragg · March 9, 2022, 1:01pm

Hello, I’m starting to get a lots of these errors and not sure why/how to correct them so after any help please. Have a RPi4 with 4TB HDD and been running since December

2022-03-09T12:22:06.684Z	ERROR	piecestore	failed to add bandwidth usage	{error: bandwidthdb: database is locked, errorVerbose: bandwidthdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(bandwidthDB).Add:60\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).beginSaveOrder.func1:722\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).Upload:434\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:220\n\tstorj.io/drpc/drpcmux.(Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52}
2022-03-09T12:22:08.858Z	ERROR	piecestore	failed to add bandwidth usage	{error: bandwidthdb: database is locked, errorVerbose: bandwidthdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(bandwidthDB).Add:60\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).beginSaveOrder.func1:722\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).Upload:348\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:220\n\tstorj.io/drpc/drpcmux.(Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52}
2022-03-09T12:22:12.149Z	ERROR	piecestore	failed to add bandwidth usage	{error: bandwidthdb: database is locked, errorVerbose: bandwidthdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(bandwidthDB).Add:60\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).beginSaveOrder.func1:722\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).Upload:434\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:220\n\tstorj.io/drpc/drpcmux.(Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52}
2022-03-09T12:22:16.694Z	ERROR	piecestore	failed to add bandwidth usage	{error: bandwidthdb: database is locked, errorVerbose: bandwidthdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(bandwidthDB).Add:60\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).beginSaveOrder.func1:722\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).Upload:434\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:220\n\tstorj.io/drpc/drpcmux.(Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52}
2022-03-09T12:22:22.159Z	ERROR	piecestore	failed to add bandwidth usage	{error: bandwidthdb: database is locked, errorVerbose: bandwidthdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(bandwidthDB).Add:60\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).beginSaveOrder.func1:722\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).Upload:434\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:220\n\tstorj.io/drpc/drpcmux.(Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52}
2022-03-09T12:22:27.360Z	ERROR	piecestore	failed to add bandwidth usage	{error: bandwidthdb: database is locked, errorVerbose: bandwidthdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(bandwidthDB).Add:60\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).beginSaveOrder.func1:722\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).Upload:434\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:220\n\tstorj.io/drpc/drpcmux.(Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52}

michaln · March 9, 2022, 3:35pm

@tre4orbragg did you search similar reports from the forum? One that looks very similar is Database bandwidthdb is locked - #2 by SGC or Weird node behaviour - #16 by YourHelper1

tre4orbragg · March 9, 2022, 6:03pm

Yeah but those didn’t have much clear direction in what to do. I’ve checked the database and now updated the config.yaml for the max connections to 20 and will see how that goes

YourHelper1 · March 10, 2022, 7:47pm

As said i am facing a similar problem. Main reason i think it is because i am using a SMR disk (4TB like you btw) which can’t handle the load. You can check by yourself:

if this happens when you accept lot of concurrent requests.
how big is the problem, i mean the percentage of the errors you are getting. If those were all the errors for the day, then your node is more than fine. If the problem appears once in 1000 or more logs then you should just forget it. Analyzing the data from logs is nice but overthinking every error will probably do nothing more than increase your blood pressure.

Using an SMR disk seem like a classic rookie mistake (like i did of course) as everyone buys the cheapest disk to provide more storage but forgets the bandwidth which is the most important here. A SMR disk though can maybe bring more benefits in the long run, because by the time you have filled it (or almost filled it), you will be having a lot of and hopefully not so many writes, and the disk will have a better performance and of course more space for his price, compared to a more expensive CMR.

What you can do:

I currently have a limit for 7 concurrent requests (you can specify that on the config.yaml file). This of course mean less payouts but this is how much my disk can handle, otherwise ram gets filled up for no reason and then i get those errors that you mentioned…

Hope this helped, but you also have to thank @michaln for tagging the right person in the right problem

tre4orbragg · March 11, 2022, 5:38pm

Thanks, there have been a lot more errors but looks like a bit less by reducing the concurrent connections so playing with that to find a good balance with load and actually getting traffic still!

allenyllee · February 26, 2023, 8:14am

I’m using Toshiba 6TB HDD, encounter the same error.
After add setting:

# Maximum number of simultaneous transfers
storage2.max-concurrent-requests: 20

Then it works!
No more “Failed to add bandwidth usage” errors, no more node suspended.

Also see: Your Node is Suspended nothing obviously wrong? - #3 by StoreMe2

Alexey · February 27, 2023, 4:16am

This is a bad option - it cancel customers transfers due to “node is overloaded” (this is what they can see). It’s not a recommended solution, it’s a workaround for weak devices and SMR disks. The better solution is to move databases to SSD disk:

allenyllee · February 27, 2023, 5:24am

But not everyone has money to upgrade to 6TB SSD…

arrogantrabbit · February 27, 2023, 5:45am

Only databases. They are (very) small.

One other option is to add a configuration option for storagenode client to not create them in the first place, or create in-ram if some are needed to store transient data.

I would not mind losing statistics I don’t look at anyway if that means not buying even a small SSD and not sacrificing performance.

In fact, nothing prevents node operators to place the databases to ramdisk. Problem solved.

allenyllee · February 27, 2023, 8:18am

How? I see the storage dir contains many files, also many blobs, which one should I move?

Alexey · February 27, 2023, 8:32am

Only *.db ones, however I wouldn’t recommend to use RAM to store them, or you need to use some scripting to copy them back to the disk before shutdown and copy them back to RAM before node start, seems complicated to me.
If you do not have SSD, it’s fine too, it was just a suggestion on case if you could have it.
SMR disks are not good unfortunately, and there is no good solution, except running several nodes each on own disk to spread the load, or using this limiting option. Unfortunately it affects both ingress and egress, so it will reduce your earnings.

arrogantrabbit · February 27, 2023, 9:08am

@Alexey, is this just to preserve local history for the node or there is another reason? I was under impression that if the databases don’t exist they will be recreated on launch. Is this not the case?

allenyllee · February 28, 2023, 2:18am

OK, I’ve moved all the *.db files into the SSD, and set storage2.max-concurrent-requests: 0. Now, all the error was gone!

Note: Remember to leave storage-dir-verification file in your storage dir, otherwise, it can not start.

Alexey · February 28, 2023, 2:36am

Yes, if databases are not exist, they will be recreated. However, you will:

lose your previous and current Stat;
you cannot disable filewalker, because database is empty, so it must be enabled; on SMR disk it could take days to finish.

arrogantrabbit · February 28, 2023, 4:10am

That’s not an issue.

That I did not know, thank you. in the hindsight — it’s obvious!

One solution here is to only the chunks database to be persisted on disk, while all other, “unimportant” stats related ones — in-ram. This will reduce IO pressure on the drive, and may be enough.

On a separate note:

SMRs are horrific at writes, but reads should not be affected.

On the other hand, having filewalker access each chunk at the start effectively pre-warms the in-memory filesystem metadata cache, thereby improving subsequent time-to-first-byte on actual requests, ultimately contributing to better payouts by winning more races. So I would keep filewalker enabled if only to facilitate that.

But of course if the hardware is so shoddy that even that is too much (I.e. low ram microcontroller based with software usb connected mass storage device) then I guess cheap SSD to keep databases and avoid filewalker would definitely be superior approach.

Alexey · February 28, 2023, 4:17am

and more points of failure.

unfortunately not all SMRs built equally, some perform their optimizations while you read, so read become slow too.

arrogantrabbit · February 28, 2023, 4:22am

But so does moving all databases off-main drive.

Or do you mean keeping track of what is persisted where? It can be done outside of storage node, just via symlinks, but ultimately yes, while it’s extra things to keep track of, if hardware is inadequate, some compromises must be made somewhere: either add fragility, or pay with performance.

I dislike the whole idea of running nodes on odroids and other raspberry pies in the first place, but if SNO already went that route — few more points of failures won’t matter, especially if this helps avoid buying extra hardware (SSD or better drive)

Oh wow…

Alexey · February 28, 2023, 4:56am

yes to both and also, in additional to the point of failure when you moved databases you will add a point of failure by moving them partially. It even sounds as complicated.
By the way, I’m not sure that symlinks will work well for databases.