Eleos
April 29, 2024, 5:48pm
1
So a few days ago I noticed this starting to happen but didn’t think much of it. Today I am realizing that the following is happening:
Data is not being retained on the node
Node shows it being super active (150GB+ ingress per day, disk sounds really active constantly like data is being written)
Data goes from let’s say 5.97TB to 6.10TB, randomly goes back down to 5.97TB but with no additional trash (always at 0.79TB trash)
Each time this happens, the node was restarted (uptime back to just a few minutes)
Suspension and Audit does not get affected (all 99% something)
Is there something wrong with my node or drive?
Please let me know if any of this info is sensitive but here are Errors I see in the log when doing: docker logs storagenode | grep "ERROR"
2024-04-29T18:13:38Z ERROR piecestore upload failed {"Process": "storagenode", "Piece ID": "SDFOKDP4QXV342QFGKMMO5HJQOWSIXL6APHYXALNIKGDXPLZW4HA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "216.66.40.82:8544", "Size": 62464, "error": "pieceexpirationdb: database is locked", "errorVerbose": "pieceexpirationdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).SetExpiration:67\n\tstorj.io/storj/storagenode/pieces.(*Store).SetExpiration:608\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:487\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:520\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:294\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:167\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:109\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:157\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2024-04-29T18:13:22Z ERROR piecestore upload failed {"Process": "storagenode", "Piece ID": "CLGDBINULTNR3P5N2O6FNZHLV4LTS34VUUTOQY24F6HLANLLWSHA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.226.101:55460", "Size": 131072, "error": "manager closed: unexpected EOF", "errorVerbose": "manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).read:68\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:113\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:229"}
I am using a rpi 4 8GB, an exernal SSD via USB 3 for the OS + identity, an external USB 3 HDD for the node, Ethernet, Rasberry Pi OS 64 Lite (CLI headless), docker, No-IP. Screenshots are provided below.
bre
April 29, 2024, 7:14pm
2
Hi @Eleos ,
Any customer usage pattern is normal. We can not predict future patterns based on prior usage patterns.
However it’s been suggested you look into database lockup problems
1 Like
Eleos
April 29, 2024, 7:26pm
3
I know customer usage can change, I first thought it was just everyone backing up their data at the end of the month but it seems I have some issues noticed around the same time that this increase in usage started.
How exactly do I do that? I have no knowledge of database files, what they should look like, reading errors, ect.
bre
April 29, 2024, 7:41pm
4
These links are for the most common database issues encountered by SNOs
If none of these are applicable please let us know and we can look further
https://support.storj.io/hc/en-us/sections/360004515252-Databases-Issues
I got it run for the first piece_expiration.db now.
Was struggeling with the docker command, which I don’t need at all with sqlite3 installed and with all the /storage/-paths who have all to be adjusted.
Hope it works soon.
Update:
Perfect it worked for piece_expiration.db and for storage_usage.db and the node is running fine now. The guide should still have more hints, that the commands should be adjusted and not just copied over.
this seems to have helped solve a similar issue in the past for another user Database locked,node crash - #4 by zip
1 Like
Eleos
April 29, 2024, 7:53pm
6
Thank you. I am currently looking into this so I will come back after I have more info on if it solved my issues.
2 Likes
Alexey
April 30, 2024, 4:27am
7
Seems it’s time to move databases either to SSD or a flash card/stick:
How to move DB’s to SSD on Docker
Before you beginning, please make sure that your SSD has good endurance (MLC is preferred), I personally recommend using SSD mirror.
look into the official documentation and make sure that you are using –mount type=bind parameter in your docker run string
Prepare a folder with mounted SSD outside of <storage-dir> from the official documentation. (it your folder with pieces)
Add a new mont string to your docker run string:
Now we have:
docker run -d --restar…
Please also check your logs regarding errors related to a retain
process. Also, is lazy mode is enabled? Did you disable a used-space-filewalker on start or not?
Eleos
April 30, 2024, 4:44am
8
I was looking into that earlier and moved them over to the external USB 3 SSD I have the OS on which has a 300TBW rating. Things seem to be running well now so far for 8 hours and I will continue to monitor it for the new few days. I am guessing that the HDD could not keep up with the databases and 160GB per day rate but could at 80GB a day rate.
Running docker logs storagenode | grep "ERROR"
still shows a lot of the following error
2024-04-29T21:27:40Z ERROR piecestore upload failed {"Process": "storagenode", "Piece ID": "SZLQBO6XHDTMTIJVJ7KGN6BP3Z2WR2HLDBGK3CX6OEXEM4345BEA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.226.98:52044", "Size": 196608, "error": "manager closed: unexpected EOF", "errorVerbose": "manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).read:68\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:113\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:229"}
I am not sure about the lazy file walker but in the config file this is what I have for it. I am guessing it is disabled since it’s commented out:
# run garbage collection and used-space calculation filewalkers as a separate subprocess with lower IO priority
# pieces.enable-lazy-filewalker: true
1 Like
Alexey
April 30, 2024, 4:55am
9
Yes, your node is slow for this customer, so it canceled it.
Since it’s commented-out, then this is mean - it’s unchanged. The default value is true
, so lazy mode is enabled.
Eleos
April 30, 2024, 5:19am
10
For this error, is there anything that I can do to make it faster for this customer so that the node does not cancel accepting some of the data?
Like I mentioned above, I am using a rpi 4 8GB, an exernal SSD via USB 3 for the OS + identity (and database now), an external USB 3 HDD for the node, Ethernet, Rasberry Pi OS 64 Lite (CLI headless), docker, No-IP
I think no. You cannot win all the races and be close to the everyone customer in the world.