Debugging space usage discrepancies

elek · January 8, 2024, 3:51pm

I wrote this quick how-to to help debugging space discrepancies (When reported used space is different from the observer one). It might be too technical, and probably you don’t need it for normal Storagenode operations. But if you are interested what’s going on under the hood and keen to debug internals, here it is.

There are multiple sources of the used space:

Operating system
Storagenode
Satellite

Operating system

Let’s start with the operating system based space usage. blobs directory supposed to storage all the segment files, each has a small metadata header (512) and the raw data. Also, you may have overhead depending on the used file system.

In my case:

sudo du --max-depth=1 
4	./temp
26060	./trash
8612044	./blobs
4	./garbage
8639576	.

sudo du --max-depth=1 --apparent-size
4	./temp
26056	./trash
8599010	./blobs
4	./garbage
8626659	.

As you see, OS thinks that I used 8.6 GB (8.2 GiB)

blobs directory has a separated subdirectory for each Satellite. You can double check if unused Satellites has storage directories or not.

trash directory is separated and will eventually be purged. Having too big trash can also be a signal of a problem.

Storagenode

Storagenode calculates the space with walker, which checks all the blob files one by one. Due to the many small files it’s quite extensive operation, therefore it’s executed during the startup, and counters just updated in-memory during new reads/writes.

Executions are async, as the result of the previous execution is stored in a local cache database, it may not fully update…

Note: To avoid huge IO pressure for big nodes, there is a lazy walker implementation which executes the size calculation in a separated process with lower IO priority.

To check this value, you have 3 options:

A) check the current value (from the dashboard or from the JSON output)

This one is displayed on the Storagenode Dashboard, but you can also get it in the geeky way: curl -v 127.0.0.1:14001/api/sno/ | jq '.' (or just 127.0.0.1:14001/api/sno in your browser).

In my case:

{
  "nodeID": "...",
  "wallet": "0x0000000000000000000000000000000000000000",
  "satellites": [
    {
      "id": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S",
      "url": "us1.storj.io:7777",
      "disqualified": null,
      "suspended": null,
      "currentStorageUsed": 8776057856
    }
  ],
  "diskSpace": {
    "used": 8778217472,
    "available": 1000000000000,
    "trash": 0,
    "overused": 0
  },

As you see I connected only to one satellite (this is a special storagenode what I have access to).

Both satellite level and summarized numbers (diskSpace) are based on the results of the file walker.

(B) cached values

These numbers are also stored in a local database to make the next startup fast (we don’t need to wait until everything is calculated).

/opt/storagenode1/config/storage# sqlite3 piece_spaced_used.db
SQLite version 3.40.1 2022-12-28 14:03:47
Enter ".help" for usage hints.
sqlite>select * from piece_space_used;
8778217472|8776057856|
0|0|trashtotal
8778217472|8776057856|O
                       Ll+z&'RMn

Here you can see the data both with and without the metadata header (512 bytes per blob file).

In my case: (8778217472 - 8776057856) / 512 = 4218, so I expect that number of blob files in my storage directory.

Which seems to be fine:

cd storage/blobs
find -type f | wc
   4218    4218  476634

(I have also one satellite, so I didn’t enter inside subdirectories)

Satellite

Thanks to the walker separation, the value also can be checked with the walker itself (for each satellite).

echo '{"satelliteID":"12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}' | ./storagenode used-space-filewalker  --pieces ./config/storage --storage ./config/storage --info ./config/storage/pieceinfo.db --info2 ./config/storage/info.db 
2024-01-08T13:29:28Z	INFO	Database started	{"process": "storagenode"}
2024-01-08T13:29:28Z	INFO	used-space-filewalker started	{"process": "storagenode"}
2024-01-08T13:29:28Z	INFO	used-space-filewalker completed	{"process": "storagenode", "piecesTotal": 8778217472, "piecesContentSize": 8776057856}
{"piecesTotal":8778217472,"piecesContentSize":8776057856}

(3) Finaly the space which is calculated by the Satellite itself. This is the base of the payment (and supposed to be close to the value what storagenodes see).

This is calculated by Satellite 2-3 times per day. We call this component as ranged-loop as it iterates over the segment data in ranges, and does different calculation on the stored segment metadata. In our case it’s the sum of the sizes.

The values are retrieved by storagenode and stored in a local database:

cd /opt/storagenode1/config/storage
sqlite3 storage_usage.db "select timestamp,hex(satellite_id),at_rest_total,interval_end_time,julianday(interval_end_time) from storage_usage order by satellite_id,timestamp desc limit 6"
2024-01-08

My results:

00:00:00+00:00|A28B4F04E10BAE85D67F4C6CB82BF8D4C0F0F47A8EA72627524DEB6EC0000000|37182796689.5248|2024-01-08 03:01:24.917097+00:00|2460317.62598284
2024-01-07 00:00:00+00:00|A28B4F04E10BAE85D67F4C6CB82BF8D4C0F0F47A8EA72627524DEB6EC0000000|218215160132.345|2024-01-07 22:40:25.315725+00:00|2460317.44473745
2024-01-06 00:00:00+00:00|A28B4F04E10BAE85D67F4C6CB82BF8D4C0F0F47A8EA72627524DEB6EC0000000|180876738958.695|2024-01-06 21:08:43.610494+00:00|2460316.3810603
2024-01-05 00:00:00+00:00|A28B4F04E10BAE85D67F4C6CB82BF8D4C0F0F47A8EA72627524DEB6EC0000000|269904445870.378|2024-01-05 23:59:07.048166+00:00|2460315.49938713
2024-01-04 00:00:00+00:00|A28B4F04E10BAE85D67F4C6CB82BF8D4C0F0F47A8EA72627524DEB6EC0000000|154081883010.896|2024-01-04 16:24:36.307736+00:00|2460314.18375356
2024-01-03 00:00:00+00:00|A28B4F04E10BAE85D67F4C6CB82BF8D4C0F0F47A8EA72627524DEB6EC0000000|199409017263.589|2024-01-03 22:23:04.446944+00:00|2460313.43269036

(please note that you need to do this per satellite)

The trick here is that we don’t really have exactly daily values, because the space usage is update periodically. So what we need is to check the period between one line and the previous line. For example:

37182796689.5248 / (2460317.62598284 - 2460317.44473745) / 24
	~8547986767.53580693371934407086

Here, the first value is the at_rest_total in byte hours. 2460317.62598284 - 2460317.44473745 is the difference between two lines (julianday(interval_end_time)) (in days).

As you see here, we have a few hundred megabytes differences between the space calculated by Satellite and space calculated by Storagenode. Usually it’s because there were deleted segments which are not yet deleted on the Storagenodes.

I wouldn’t like to go into the details here, but Satellite does sg. like this:

Example:

last daily calculation finished at 22 pm at Monday
23 pm at Tuesday
21 pm at Wednesday

In this case the Tuesday report will report 25h (104%) of usage, Wednesday report will include 91% of the usage (22h). This doesn’t make any problem during billing, as usage is calculated based bytes-hour, but can be misleading (that’s why we calculated the used hours above, with using juliandate function)

Segments might be deleted in two different ways:

By Storagende itself, based on expiration time (if segments are uploaded with expiration time)
By Satellite, if related key/object is deleted.

The second is more complex, as this is an async process:

Satellite should iterate over all the available segments and group them by Storagenodes
Each group is compressed and archived to a Storj bucket (more technically: it’s not a compression, but a bloom-filter. It’s is way more smaller, but a few segments may be deleted only later time…)
And finally Satellite delivers all the filters to the Storagenodes (Retain call)
Storagenode will check all the bloom filters one by one and Trash the pieces which are not required any more
Trash content is checked every 24 hours and files which are deleted earlier than one week are permanently deleted

This is usually called once per week (it’s quite an expensive process as we need to check all the segments).

If you have Prometheus, you can check a few interesting metrics:

(TLDR; metrics with retain_ prefix are updated when Satellite sends the request to Storagenode, garbage_collection_ metrics are updated when jobs are done…):

retain_creation_date{field="count"} → This is incremented by one when the bloom filter execution is created. Usually during the weekends…
garbage_collection_pieces_skipped{field="recent"} → Number of pieces which couldn’t be read during the last GC process
garbage_collection_pieces_count{field="recent"} → Number of all piece blobs.
garbage_collection_pieces_to_delete_count{field="recent"} → pieces which supposed to be deleted
garbage_collection_pieces_deleted{field="recent" → pieces which actualy deleted (ideally, should be the same as the previous number)
garbage_collection_loop_duration{field="recent"} → GC execution time (seconds)

If you don’t have prometheus, you can check the debug endpoint manually (but without the historical data):

For example:

curl 127.0.0.1:8001/metrics | grep garbage_collection | grep recent

Assuming you set the STORJ_DEBUG_ADDR=127.0.0.1:8001 environment variable for the storagenode process.

Check the log!

This is how it supposed to work, but any of these components can be failed (but not typical). So if you have any strange disk space usage discrepancies:

check for any logs (especially eroor lines with reference to walker, gc/garbage_colector, bloom filter or trash)
check the size of the trash (only the pure blobs folder should be in sync with the usage calculate by Satellite)
check if you had GC proesses recently…

elek · January 9, 2024, 9:55am

Looks to be a bug, garbage_collection_pieces_count and garbage_collection_pieces_skipped are swapped when lazy file walker is used (default).

This should fix the problem:

https://review.dev.storj.io/c/storj/storj/+/12076

Until that, you can check the value of ..._skipped

Not: it’s still not very useful metrics, as the value is updated after any execution. If two satellites have totally different values, the first one may not be visible, only by max/min fields.

I am considering to add the satellite as tags to all of these metrics…

Ottetal · January 9, 2024, 9:57am

Very interesting!

I Look forward to try this at home. Thank you for a great post friend.

Ruskiem · January 9, 2024, 12:43pm

Thats all seems cool, but as You can see in discrepancy topic, ppl don’t understand or aren’t able to follow those complexity.

For me as a SNO, to debug i just need to know, when was the last time fileWalker finished its work on each sattelite.

What’s the simplest (thus fastest) way to find out when was the last time filewalker was able to finish that scan on startup?

If i had that in dashboard, and could look, like: “aha, on us1. it finished full scan 78h ago, on eu1. 24h ago, on ap1. 101h ago” then i would know theres noting to worry about, because:

1. deletes could occur any moment for e.g. 1-2TB, if customer wishes,
1. and scan can take days to catch up with the reality.
1. and node can gain 1-2TB ingress any moment if customer wishes, and again see 2)

Most important things is to know, how much the scan took on each satellite, and when it last time finished. Other things are just secondary. If my filewalker aren’t able to finish, or it takes 6-9 days for one satellite, then i need to know this in first place, in steps as easy like: opening a node dashboard, and looking up. THEN i can think about what to do, to fix that, looking into complex debugging like those here Elek …

more ...

, or just avoid all that, and make immediate changes to the hardware, that i know are 99% the reason, (like the HDD sATA controler can be too shitty(customers grade 1x PCI chinesium inventions with 4-8HDDs connected simultaneously - time for a real conntroler like H310 or LSI 92xx one, or HDD is some old fart and just time to replace for new.)
Looking into those complex debugging is like last thing anyone want. Maybe if You would pay me $2,5/TB stored, not $1,5 like now hah.

elek · January 9, 2024, 2:57pm

The easiest way is checking the logs.

There are two file walkers:

garbage collector file walker (started when satellites calls the SN with the bloom filter).
used space calculator (started during startup)

(1) Garbage collector start and end are logged

Example for starting GC (for one satellite):

2023-11-15T14:25:45Z	INFO	retain	Prepared to run a Retain request.	{"process": "storagenode", "Created Before": "2023-11-08T17:59:59Z", "Filter Size": 9778, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}

Example for finishing the the GC (for the same satellite):

2023-12-20T11:10:45Z	INFO	retain	Moved pieces to trash during retain	{"process": "storagenode", "num deleted": 1113, "Retain Status": "enabled"}

(2) Both of the file walkers are logged if they use the lazy file walker (default)

Space usage calculation example:

2024-01-09T11:50:03Z	INFO	lazyfilewalker.used-space-filewalker	starting subprocess	{"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-01-09T11:50:03Z	INFO	lazyfilewalker.used-space-filewalker	subprocess started	{"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-01-09T11:50:03Z	INFO	lazyfilewalker.used-space-filewalker.subprocess	Database started	{"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "process": "storagenode"}
2024-01-09T11:50:03Z	INFO	lazyfilewalker.used-space-filewalker.subprocess	used-space-filewalker started	{"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "process": "storagenode"}
2024-01-09T11:50:18Z	INFO	lazyfilewalker.used-space-filewalker.subprocess	used-space-filewalker completed	{"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "process": "storagenode", "piecesTotal": 204536366848, "piecesContentSize": 204484288256}

GC file walker log example:

2023-10-11T11:39:47Z	INFO	lazyfilewalker.gc-filewalker	starting subprocess	{"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2023-10-11T11:39:47Z	INFO	lazyfilewalker.gc-filewalker	subprocess started	{"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2023-10-11T11:39:47Z	INFO	lazyfilewalker.gc-filewalker.subprocess	Database started	{"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode"}
2023-10-11T11:39:47Z	INFO	lazyfilewalker.gc-filewalker.subprocess	gc-filewalker started	{"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode", "createdBefore": "2023-10-04T17:59:59Z", "bloomFilterSize": 7344}
2023-10-11T11:39:50Z	INFO	lazyfilewalker.gc-filewalker.subprocess	gc-filewalker completed	{"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode", "piecesCount": 68415, "piecesSkippedCount": 0}
2023-10-11T11:39:50Z	INFO	lazyfilewalker.gc-filewalker	subprocess finished successfully	{"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}

elek · January 9, 2024, 3:06pm

The most interesting fields (IMHO) are num deleted (the number of deleted pieces) and Created Before.

Bloom filter can decide if one piece should be deleted or not (it includes the essence of the satellite knowledge ).

But this knowledge is always out-of-date. What happens if bloom filter is generated at the morning, but received by the storagenode at the same evening? All new files (created since the morning) are looks like unknown, as Satellite didn’t have any chance to know about them.

Should we delete them? Certainly not.

For this reason the bloom filter includes the validate date. (the date of the newest segment at the time of the creation of the bloom filter). Storagenode should check only the older files. And keep all the new files.

But the clock of the Storagenodes are not very precious either. Even if NTP is recommended, some nodes are way behind the current time. Which makes the date based exclusion less reliable.

For this reason additional 3 days are substracted from the current date.

Logged Current Date = newest segment during bloom filter creation - 72 hours

Storagenode will delete only older segments. Bloom filter is generated from a backup database (to avoid any additional load on the production database). For this reason, even after a successful GC file walker run, you may have 3-7 days old segments on the local disks…

Ruskiem · January 9, 2024, 3:36pm

Great, now get me that info to main page on node dashboard please.
Because often i can’t even OPEN my storagenode.log (LOL)
they are too big! (not enough free RAM to load them)

i cut them in pieces every week or 2-3 times a week for smaller size, then tracing in which log file the filewalker started, and in which it ended becomes a headache even more… and times that by each sattelite, and times many nodes…

daki82 · January 9, 2024, 7:51pm

set the loglevel to FATAL(lazy one) or redirect the logs (bit more work).

zip · January 9, 2024, 9:26pm

I was wondering how storagenode determines what not to count during used space filewalker. This operation can take days while the node is receiving new data and eventually deleting some from trash.
I assume tallies for new uploads are updated on the fly in the database, but how do you prevent used space filewalker to count them the second time if these new data are stored in a file that hasn’t been counted yet? Are you checking the timestamp of the file to say this file was created after the filewalker has started, so do not count this file?

snorkel · January 9, 2024, 11:09pm

The problem with the logs is that all the useful services like FW, Lazy, GC, etc. are on INFO level and nobody without a SSD with keeps that on. The ability to see their status, start, end and fail, would be very useful if they are tagged as WARN or ERROR level, not INFO, in logs. I could switch my log level from FATAL to ERROR or WARN if that would be the case, just to monitor those services/walkers.

daki82 · January 9, 2024, 11:25pm

Would appreciate a loglevel between Fatal and Error (Like Important,Vital or Critical), for vital system events like this, who are not repetitive every second, and do not accumulate in over 100MB for 10years of running the node.

penfold · January 10, 2024, 1:07am

In this context, what constitutes a “big node”?

elek · January 10, 2024, 8:50am

1TB, for example. I tried to migrate 1TB files, and it was quite slow due to the many small files.

elek · January 10, 2024, 8:51am

I totally agree with the problem, but would prefer a slightly different solution. These are not warnings or errors. It would be better to have an option to silence / separated the normal access log (change them from info to debug).

elek · January 10, 2024, 8:58am

That’s a good question. space walker uses all the blobs, but it tries to adjust the report with the numbers, counted during the walk.

Before starting, it makes a snapshot from the actual counters, and at the end, it does the same. The difference is used to adjust the current calculation.

(Still there is a race condition, and for this reason (?) only half of the difference is used to adjust. Which can be inaccurate if space walker runs for days.)

github.com

storj/storj/blob/f5020de57c9200e61aaa842013263075de229333/storagenode/pieces/cache.go#L460-L477


      
          func estimate(newSpaceUsedTotal, totalAtIterationStart, totalAtIterationEnd int64) int64 {
          	if newSpaceUsedTotal == totalAtIterationEnd {
          		if newSpaceUsedTotal < 0 {
          			return 0
          		}
          		return newSpaceUsedTotal
          	}
          
          	// If we missed writes/deletes while iterating, we will assume that half of those missed occurred before
          	// the iteration and half occurred after. So here we add half of the delta to the result space used totals
          	// from the iteration to account for those missed.
          	spaceUsedDeltaDuringIteration := totalAtIterationEnd - totalAtIterationStart
          	estimatedTotal := newSpaceUsedTotal + (spaceUsedDeltaDuringIteration / 2)
          	if estimatedTotal < 0 {
          		return 0
          	}
          	return estimatedTotal
          }

Ottetal · January 10, 2024, 10:01am

I totally agree.

Many applications implement additional log levels ↓
debug, info, notice, warning, error, critical, alert, emergency

If StorJ were to do the same, info could be as it is right now with all the pieces and information about the filewalkers and other non-warning info moved to notice

daki82 · January 10, 2024, 10:46am

As most of the log is info and error, i would still log on critical. or let it be:
debug, info, error, notice, warning, critical, alert, emergency

then i would propably choose notice. writes to the system ssd (standard install) are no jokes to gamble with here.

snorkel · January 10, 2024, 3:01pm

It just needs to remove those milions of entries about uploaded and downloaded pieces, canceled, finalised, etc. The rest what remains is useful somehow, and dosen’t take to much space or disk time.

Alexey · January 11, 2024, 7:38am

You need to use CLI tools, like PowerShell:

sls "walker|retain" "C:\Program Files\Storj\Storage Node\storagenode.log" | sls "started|finished|failed|error"

or stop the node, rename the log, start the node back, a new logfile will be created and it will be pretty small.
If you prefer a dashboard, then you may setup a Prometheus storagenode exporter and configure Grafana as you like, include time when each of filewalkers was started/finished or failed:

elek · January 11, 2024, 4:08pm

Related issue:

github.com/storj/storj

{storagenode,satellite}/gc: bloom filters are ineffective with large storage nodes

opened 11:41AM - 11 Jan 24 UTC

egonelbre

Currently we have a maximum memory limit for bloom filters, however that has a s…ide-effect of them being completely filled with nodes with large number of pieces. With simulating the bloom filter effectiveness we can see different behaviours. https://github.com/storj/experiments/blob/main/simulate-bloom-filter/main.go Here's an approximate results for different piece counts and max bloom sizes: ``` satellite add/delete storage-node bloom-size ideal-bloom-size 1_000_000 50_000 1_005_000 583 KiB 583 KiB 2_000_000 50_000 2_005_000 1.1 MiB 1.1 MiB 4_000_000 50_000 4_010_000 2.0 MiB 2.3 MiB 8_000_000 50_000 8_060_000 2.0 MiB 4.6 MiB 10_000_000 50_000 10_015_000 2.0 MiB 5.7 MiB 14_000_000 50_000 14_450_000 2.0 MiB 8.0 MiB 16_000_000 50_000 16_500_000 2.0 MiB 9.1 MiB (almost unstable) 20_000_000 50_000 unstable 2.0 MiB 11.4 MiB 20_000_000 50_000 20_130_000 4.0 MiB 11.4 MiB 26_000_000 50_000 26_300_000 4.0 MiB 14.8 MiB (almost unstable) 26_000_000 50_000 26_140_000 5.0 MiB 14.8 MiB ``` So, if we use the currently calculated optimal size we are going to have a significantly smaller overhead than our false positive rate. Randomizing the seed significantly does clearly help. This however falls apart when the bloom filter is completely filled -- seems to happen around 16-20M pieces for 2MiB and 22-26M pieces for 4MiB. Currently our largest node has 26M pieces, so bumping the bloom size to 4MiB will probably help. We may also want to adjust our bloom filter size calculation to suggest 2x smaller (or 1.5x smaller) bloom size than the theoretical result suggests. Bumping it to 5MiB should solve it somewhat, however we need to be mindful of drpc packet limit, which may need to be changed -- alternatively, we need a new message type to send larger bloom filters. One interesting approach to try is to create bloom filters only for a subsection of piece ID-s, rather than all of them. This should allow to shrink the number of piece-ids put into the bloom filter, at the cost of longer tail to cleanup the thrash. If we split it into two, e.g. only pieces `<0x80...`, then our ideal sizes should be half as they are now. --- ```[tasklist] ### Draft action items - [ ] Adjust ideal bloom filter size calculation to be smaller. We should experiment a bit more, but it does seem like 1.5x smaller size should be safe. - [ ] Increase bloom filter size as much as drpc packet limit allows - [ ] Create new protobuf that allows to send large bloom filters (larger than 4MiB) - [ ] Try piece-id selection strategy. (e.g. what if bloom filter ignored half or quarter of the pieces) - [ ] Add a log/monkit warning when the fillrate of a bloom filter is above 0.95. ```

TLDR; with high number of segments on one node, the bloom filter is less effective as it should be. Fix is on the way.

The easiest way to check if you are affected is counting the blobs in the storage directory ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa or v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa folders (US or EU satellite).

If you have at least 14M blobs (in one satellite folder) and see big discrepancies, feel free to post your numbers as comment on the issue…

Example:

cd /storj/storj01/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa
find -type f | wc -l
3665465

3 million. It should be fine.