New data not being stored

Eleos · May 6, 2024, 2:51am

So after the previous issue I had got fixed by moving the DB files over to an SSD, and everything was fine for a while, it seems that new data is not being stored. I have gotten 50-100GB of new ingress data today but none of it is reflected in the dashboard. I have been sitting at 6.14TB used all day. Along with that, data is slowly being lost but not reflected in the trash. Am I just needing to wait for a filewalker to happen or something? I have never had this issue as data has always updated on the dashboard.

pasatmalo · May 6, 2024, 3:07am

If you are referring to the average space usage graph, then it’s probably fine and you just need to wait a bit.

It sounds to me like it’s either of:

Your ingress equals the data that got deleted during today.
The satellites haven’t sent any new used space updates since the initial ones received today (because the average usage graph relies on the data from the satellites and is not calculated locally).

If your graph still seems “stuck” after a few days, there might be another issue, but it’s probably one of the above.

Regarding the trash, there are two systems in place:

TTL Pieces: Are automatically deleted when they expire and don’t appear in trash, simply being removed from the drive.
Standard deletions: When a customer deletes data, the node has to wait to receive a bloom filter before it is able to know what data is good and what is trash. These filters allow the node to determine what pieces are trash and what is (probably*) not. They are sent out about once a week, therefore trash won’t increase right after a deletion, instead it will take about a week before the deleted data is moved to the trash.

Bloom filters have some false positives, where they believe trash is actually good data, lots of discussion about it in the forum if you want to read about it

Eleos · May 6, 2024, 3:19am

So average space used graph updates just fine and so does the ingress/egress graphs. The problem is that used space and trash isn’t being updated.

So for trash, I have seen the % usage go down (42.88% used, stayed like this all day) down to (42.87%). The value of trash is still 697.72GB (0.7TB on the dashboard). Is this what you mean for the TTL pieces, small pieces of data?

What I am most concerned about is the used data not going up despite the ingress value going up a lot. I have always seen updates to this value every minute (or every hour with my script running for statistics being sent to me).

pasatmalo · May 6, 2024, 3:37am

So for the usage % going slightly down, it does sound to me like it would be some expired pieces (TTL pieces) that have been removed without going through trash.

In addition, as you have probably seen in the past, trash is also only cleared after a week after its initially collected (via the bloom filter), therefore it is normal for it to be static for a while.

For the used space graph (the pie chart), that is indeed strange. It could be the case that coincidentally the TTL pieces removed match the amount of ingress, but I doubt it. Again, some time should clarify if ingress and deletions cancelled each other precisely or if there is an underlying issue.

Given that you mentioned that you moved your DBs recently, are you seeing any errors in the logs?

If you got plenty of space left, you should have no issues for the moment. If your drive is very close to full, it could lead to storj trying to write data to a full drive, which Im not sure what would be the result of that, but might not be good.

As I’m not familiar with the specific DBs I’m afraid I can’t give you further pointers at what might be wrong except checking the logs for errors.

If there are any other relevant details that could be helpfult please share them, otherwise hopefully someone else might have a better idea at dealing with your issue.

Eleos · May 6, 2024, 6:16am

This is the only type of error I am seing in the logs I believe:

2024-05-06T04:28:16Z    ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "7UG46IJLU3QFQP2FM47VFILTVOSCQUA7WJREXHHBR2YSLLNJIQUQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.226.102:49790", "Size": 196608, "error": "manager closed: unexpected EOF", "errorVerbose": "manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).read:68\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:113\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:229"}

I did make a post before I moved the DB files here, but the "error": "manager closed: unexpected EOF" error was said to be due to my node being slow for the customer. I get a lot of these errors but also receive tons of data anyways (~60GB a day, ~2TB a month).

as for having plenty of space left, I have 6.14TB used of 14TB so no issue there.

pasatmalo · May 6, 2024, 2:58pm

Eleos:

This is the only type of error I am seing in the logs I believe:

2024-05-06T04:28:16Z    ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "7UG46IJLU3QFQP2FM47VFILTVOSCQUA7WJREXHHBR2YSLLNJIQUQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.226.102:49790", "Size": 196608, "error": "manager closed: unexpected EOF", "errorVerbose": "manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).read:68\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:113\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:229"}

This error is indeed quite common and (as far as I know) is essentially just a race lost and nothing else. As they suggested over at that thread, moving DBs to an SSD can help reduce the number of races that are lost, but it is impossible to win 100% of the times.

Could you check if the disk usage (of the blobs) reported by the OS matches the usage in the dashboard, or is it slowly drifting apart? If it matches, then it might just have been a weird coincidence. While there are multiple reasons for the number to not match exactly, they should be increasing at similar rates and not one increasing while the other is static. If the latter is the case, it sounds to me like the disk usage DB is not being updated correctly for some reason, or the software cant access the DB (doubt it, as it can access the other info just fine).

Eleos · May 6, 2024, 3:08pm

A bit of new info since yesterday. The screenshot shows that the average disk usage has fallen to nearly nothing. Trash also is changing since data was delete today as well. New used data still does not go up despite ingress going up.
Capture

pasatmalo · May 6, 2024, 3:17pm

This is normal. It only means that your node has not received any avg disk usage data from some of the satellites for today. My nodes also show the same thing. This should solve itself through the day as the satellites send usage info. This graph has no effects on payouts as that is handled by the satellite.

Where you able to check if the disk usage as reported by the OS has gone up for the storage folder while the disk space pie data has stayed the same?

Eleos · May 6, 2024, 3:25pm

Sorry, I forgot to ask how I check that on linux cli as I was not sure what the blobs are. If you mean checking through like the cli dashboard:

If you mean via something like df -h (sdb1):

pasatmalo · May 6, 2024, 3:40pm

Used space seems fine to me. The dashboard reports 6.84TB used, while df reports 6.8TB (1 decimal place), which seem to match up fine.

As you got plenty of space available, I would wait a couple of days to see if the dashboard disk used data updates, or if the disk usage as reported by the OS (df) starts to significantly drift from the data of the dashboard (e.g. df says 7TB-7.5TB used while the dashboard still says 6.84TB).

Overall it looks to me that the deletions (TTL) have matched up nicely with the ingress, making it appear as if the used space did not update. As long as the dashboard and df agree with a small margin (couple of 100 of GBs), it would not seem like there is any issue.

For reference, in the last two days I have an ingress of over 500GB, but the reported avg used space has gone from 10.1TB to 10.06TB. I have not tracked the “used” pie chart so I dont know if that value changed or not.

If you keep seeing no changes to the used space metric and df vs dashboard values start to drift, it would indicate that something is probably wrong, but with the info so far I cant tell.

Maybe someone else has a different opinion .

Eleos · May 6, 2024, 3:46pm

Thanks for taking the time to give some insight on it. Yeah I’m not sure, I’ve just never not seen the used space go up so it’s just really weird to me. The disk itself has been a lot quieter since I moved the DBs over but I think it’s just that, it has less read/write to do.

I’ll give it a couple of days and report back here if anything changes or if anyone else has any info about it.

arrogantrabbit · May 6, 2024, 5:07pm

This is a problem with using interpolated plots to not only show discrete data, but interpolating between different types of data.

This drawing has four issues:

It is a wrong plot. It shall be bar graph, for each day, showing both in and egress, on the same stacked bar. Not a line with a filled interior.
It shall be transparent enough to see the axis. What’s the point of axis if they are never visible?
All data points except the last mean “data transferred on that day”. Last datapoint means “we don’t yet know how much was transferred today” or “this much was transferred so far”. It should not be interpolated with the rest of data. It’s just dumb. Using bar graph solves this too.
Timezones exist.

An example of how the daily data transfer visualization shall look like is darkstat:

Alexey · May 7, 2024, 5:07am

This is sounds like your databases either corrupted or it’s a permissions issue.
Could you please check your logs regarding errors related to the databases?
I would also recommend to check them all for integrity:

Also check permissions, is your docker user able to write to them?

Regarding not updating databases on trash deletion when the lazy mode is enabled - this seems a bug:

Eleos:

This is the only type of error I am seing in the logs I believe:

2024-05-06T04:28:16Z    ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "7UG46IJLU3QFQP2FM47VFILTVOSCQUA7WJREXHHBR2YSLLNJIQUQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.226.102:49790", "Size": 196608, "error": "manager closed: unexpected EOF", "errorVerbose": "manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).read:68\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:113\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:229"}

this is usual the long tail cancelation, your node is slower than others. In this case this piece will be deleted and the usage will reman the same or reducing when some pieces are deleted on a TTL expiration or from the trash.

this info is reported by the satellites and sent to nodes with a delay up to 12 hours, please wait until tomorrow.

Alexey · May 7, 2024, 5:12am

Actually we have had a very similar visualizing before we started to count timestamps on the node in the tally reports, before it was look like this:

and when you hovered a mouse above the line, the data points were appeared, so it was like a barchart but with connected tops and smoothing.
Then after multiple complains we implemented to show not GB*h, but TB*m and the average instead of a real day usage estimated for the month, and started to match timestamps to make the graph smooth and show decreasing or increasing a usage accordingly the month average.

Eleos · May 7, 2024, 5:32am

So firstly, I’d like to attach a screenshot to update on some details:
Capture
It has dropped to 0B and will probably stay there. I have never seen this before.
More files were deleted and sent to the trash today so from 0.7TB to 0.9TB but still no additional data stored.

The database files and folders have the correct user permissions and not root so all good there.

As far as I can see when grepping for “ERROR” in the logs, I haven’t seen any “malformed” error. Should I still follow the instructions from that link to verify?

arrogantrabbit · May 7, 2024, 5:44am

As far as I understand, while the [satellite? Client?] is busy [preparing filter? Cleaning trash?] for longer than a day the plot instread of skipping a value assumes zero. The same happened recently, and I’m seeing the same now as well:

This is a bug where “no value” and “value or zero” is conflated.

Another reason that these graphics not only waste disk IOPS while providing superficial value, and waste SNOs time who care enough to investigate but also waste development time prioritizing and fixing these artifacts.

We need a green circle when all is good and red circle with logs when not. That’s it. Nobody needs graphs and plots.

Alexey · May 7, 2024, 6:05am

My nodes didn’t receive a tally report from US1, EU1 and Saltlake. AP1 sends them regularly.
Reported to the team. However usually these gaps are connected, when the node would receive a report, but I believe if it is missed more than a one report, the gap will remain.

Eleos · May 7, 2024, 6:12am

So everyone is seeing this gap for the average used graph?

I ran the sqlite3 ./bandwidth.db "PRAGMA integrity_check;" command on each of the 16 .db files and they all report ok. Is there anything to do at this point other than wait and see?

Edit: After I did that check and started the node back up, used space is now at 6.14TB intead of 5.93TB, trash went back down to 0.7TB instead of 0.9TB, and the average disk used went from 4.65TB to 3.99TB.

arrogantrabbit · May 7, 2024, 6:17am

Do nothing. Just keeping the node online and adding storage when it fills up is the extent of storage node operator duties as far as I’m concerned. I only lift a finger if my external monitoring says the node stops reporting AllHealthy. Until then — what it does it storj’s business.

this just updates the visuals in the web console. It does not affect anything else. I would also ignore it because it’s not realtime nor accurate and what’s more important — non-actionable. There is nothing there that you can base decisions on, so it’s a 100% a gimmick.