Very high IOPS on DB SSD after 1.110 deploy

mike · August 19, 2024, 11:13am

** UPDATED POST WITH BETTER GRAPHS

Been seeing high SSD IOPS load on DB’s for several nodes (not all, but some) after 1.110 deploy

Badger enabled and not mounted to SSD (Running on storage HDD)

Docker setup on Ubuntu LTS 22.

SSD IOPS

SSD MB/Sec

Anyone seeing similar load and able to shed some light on this?

Ottetal · August 19, 2024, 12:53pm

I see high loads every time I restart my nodes, because I don’t use lazy file walker. As such, a full search of the node will be done upon reboot, thus giving large IOPS

EasyRhino · August 19, 2024, 3:47pm

my database drive tends to do brief read spikes around 14MB/s every minute or two. not as high as that. not running the badger.

mike · August 19, 2024, 3:55pm

I have the same settings - no lazy and scan-on-startup enabled. This is 10-20x higher load than previously.

EasyRhino · August 19, 2024, 6:36pm

I wonder if it’s something benign like, now that the badger cache is available, the startup operations are much faster and then require database updates to be much faster…

mike · August 21, 2024, 7:04am

I’ve noticed that my expire DBs are significantly larger on the nodes causing the high IO.

Most of my nodes have expire DBs of 2-4GB but these are 10-15GB.

Any changes in how expires are handled or walked through in this recent release?

Alexey · August 21, 2024, 8:08am

What’s do you mean? There is no known issues so far.

mike · August 21, 2024, 4:12pm

I’m not saying it’s an issue or unintended, but - on 3 of 8 nodes on this particular machine, the pieces_expire database is 10-15GB and for the remaining it’s much smaller. Those are the only large files from the nodes on the SSD, so I suspect the traffic is aimed at those files.

The files:

-rw-r--r-- 1 root root  12G Aug 21 18:01 sn251/piece_expiration.db

-rw-r--r-- 1 root root 2.0G Aug 21 17:44 sn252/piece_expiration.db
-rw-r--r-- 1 root root 2.4G Aug 21 17:56 sn253/piece_expiration.db
-rw-r--r-- 1 root root 2.4G Aug 21 17:24 sn254/piece_expiration.db

-rw-r--r-- 1 root root  13G Aug 21 17:47 sn255/piece_expiration.db
-rw-r--r-- 1 root root  12G Aug 21 18:09 sn256/piece_expiration.db

-rw-r--r-- 1 root root 2.3G Aug 21 18:14 sn257/piece_expiration.db
-rw-r--r-- 1 root root 2.5G Aug 21 18:07 sn258/piece_expiration.db

I’m sure the pieces_expire databases are this large from previous issues / slowness on the nodes / cancelled or nor complete uploads etc.

So my expectation is that the database is “as expected” though it probably has a lot of obsolete entries.

For reference - I have checked consistency and vacuumed them - all is good.

These are faily old nodes - 2-3 years, so many things could have happened to these databases in the past.

What i am concluding is:
Since v 1.110 - the load on these expire databases has been very high - i’m sure the nodes are actually doing “something” - I just do not know what it is doing or for how long it’s going to do it.
I guess it’s running through all the expire entries in the DB’s and trying to delete files, but it’s a VERY long list of files that probably already has been deleted by other functions.
In addition - this behaviour is different from previous versions (as the size of these DBs has been this large for several month).

I am wondering if it would make sense to delete these pieces_expire databases from the 3 nodes and let the GC / BF perform the cleanup instead.

Anything else I should be aware of, if I decide to go down this road?

Alexey · August 22, 2024, 8:06am

I believe it’s unrelated to the new version (I didn’t notice much of difference), it’s more related to another test round:

mike · August 22, 2024, 12:41pm

Ok thanks.

I’ll let it hammer away on the disk for now.

If I decide to delete the piece_expiration DB from these nodes, it will be recreated on start and only impact is that GC / BF will have to clean pieces up that was previously in the DB?

Alexey · August 23, 2024, 3:43am

It will not, the node is unable to recreate only one missing database, so, if you would delete this DB, you need to follow this procedure:

yes, but until the next version these pieces could be never cleaned, see

mike · September 14, 2024, 8:44am

Update on this issue.
Load has been continuing since my first post, and now I had a few hours to dig deeper into the cause of the issue.

Hardware

Intel CPU / 32GB RAM
Seagate X16 / X18 disks (1 disk = 1 node)
Samsung PM893 2TB SSD (shared for logs, DBs and other non-StorJ related workload)
LSI Megaraid controller (IT-mode)
Ubuntu 22LTS
Docker nodes

Findings

HUGE backlog on expired pieces in piece_expirations.db, example:

sqlite> SELECT DATE(piece_expiration) AS expiration_date,COUNT(*) AS piece_count FROM piece_expirations GROUP BY DATE(piece_expiration)
2024-07-19|6034937
2024-07-20|6651785
2024-07-21|5668095
2024-07-22|7423387
... and so on ...

Total entries for this node:

sqlite> SELECT count(*) FROM piece_expirations;
70610388

Whenever the node selects the next batch of 1000 pieces to delete, it causes SSD with DB file to read excessive amounts of data (avg 2-400MB/s for 20-30 seconds - typically most of the database dataset of 10-15GB)
Tons of “file does not exist” WARN log messages
2024-09-13T22:17:28Z WARN collector unable to delete piece {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Piece ID": "2KDVVEO7TLJDBL2UB3X7736HIITPVZ56KAVWUYCCVALTBHWW5T3Q", "error": "pieces error: filestore error: file does not exist", "errorVerbose": ...

My Thesis

During the test data “explosion”, amount of piece to expire grew too much for my nodes IOPS to keep up
When the garbage bug was fixed with release v1.110 / Bloom filters to exclude expired pieces - BF’s removed this data
Race condition between piece_expirations and BF overloading disks IOPS capacity
Quick calulation indicates 5-6 months of 100% IOPS time to complete backlog (expected 150-200 IOPS on a Seagate X16)

My Fix
Delete obsolete expired pieces from database, as it would take months and months to process backlog. And the wast majority of this data was already deleted by the BF change to explude expired pieces.

How

Delete expected obsolete pieces from expire table:

sqlite> DELETE FROM piece_expirations WHERE piece_expiration LIKE '2024-07%';
sqlite> DELETE FROM piece_expirations WHERE piece_expiration LIKE '2024-08%';
sqlite> DELETE FROM piece_expirations WHERE piece_expiration LIKE '2024-09-0%';

… and let the node process through the remaining days from 2024-09-10 to current date (as anything before that should have been already handled by BF / GC)

Run VACUUM on database
- Before (70610388 records)
  -rw-r--r-- 1 root root 12G Sep 14 08:37 piece_expiration.db
- After (860179 records)
  -rw-r--r-- 1 root root 124M Sep 14 09:14 piece_expiration.db

Results

Database SSD load savings: +5k avg.
Database SSD load savings: 2-300MB/s read load
Storage spinning disk load saving: ~150IOPS lower
Almost no “file does not exist” errors in logs
BF’s process MUCH (+10x faster)
Log files are much smaller
My other workload on the server is running much smoother

Current load after changes / fix
SSD IOPS

SSD MB/Sec

DISCLAIMER: This is NOT a complete guide, just my findings and how I resolved the load issues on these nodes. Please do not blindly perform anything from this post, if you do not know the implications, ie. when to stop the nodes for various tasks etc!

Just my 2 cents, in case someone else can benefit from it

Alexey · September 14, 2024, 9:50am

Kind of expected. Everyone node has experienced this.
Since we have implemented a workaround for the expired pieces not deleted in time, we would hope that these mass deletions would not happen in the future and would be handled better.

mike · September 14, 2024, 9:53am

I agree.

However, if these nodes had been left to handle it on their own - it would have taken months to complete.

Alexey · September 14, 2024, 9:54am

Not necessarily, but depends on an underlying disk subsystem of course.
My nodes handled that (with enabling a badger cache for the biggest one) pretty good. After a month all nodes reported a correct used space, the correct used space in the trash and everything looks good. The total amount of 9.28TB.
They are on Windows, so NTFS. Two nodes in the Docker Desktop and one under a Windows Service.
So, the worst scenario.

I didn’t do anything manually, they handled themselves.

And also I didn’t watch them every second. Maybe it’s an Important part.
I did some maintenance, when I received a message, that my nodes goes to offline (WSL2 is not so robust, as I would expect), however, the restart of the needed services is helped.

mike · September 14, 2024, 10:04am

True - it will vary, and I agree that my nodes are not anywhere near worst case hardware for this kind of setup:

Single disk per node (dedicated)
Dedicated LSI 2008 SATA controller
SSD for log + DB offloading
Badger enabled
Plenty of free CPU time

Which is probably why I’ve won many races, and maybe seen a higher intake than other low performance nodes.
Each of these had 50-70 mill pieces, and I guess that would contribute negative to the escalated impact on the race condition between BF/GC vs expire deletions.

Sure thing, overanalyzing metrics related to this kind of workload can have negative consequences However, in this case I left it for a few weeks and went back to the issue.

Anyway - it’s resolved for me, and the nodes and I are happy.

Again - I just shared this info to aid someone else that could be in the same boat

marvinso1978 · September 21, 2024, 9:09pm

should the node be stopped when running these delete and vacuum command?

Alexey · September 22, 2024, 3:46am

Yes, the node must be stopped. SQLite do not like a parallel access from different processes.