Which command did you use to find this port ?
Please check your logs today, I believe it will start again (itās should start every 1h by default).
netstat -l
then checked every random port in the output. I was lucky this time and it was a first port in the output.
Okay, Iāll do it after a full day or more has passed.
But isnāt that the default to 1 hour? As set in the parameter from config:
# how frequently expired pieces are collected
# collector.interval: 1h0m0s
Or is it about some other timer/parameter?
Oh, you are correct. 24h for the trash filewalker, the TTL collector should run every hour by default.
However, I think it doesnāt print to the logs if nothing to proceed. See:
2024-07-07T10:38:02Z INFO collector collect {"Process": "storagenode", "count": 1}
2024-07-07T13:39:47Z INFO collector collect {"Process": "storagenode", "count": 2992}
2024-07-07T14:40:42Z INFO collector collect {"Process": "storagenode", "count": 5973}
2024-07-07T16:38:02Z INFO collector collect {"Process": "storagenode", "count": 1}
2024-07-07T21:38:02Z INFO collector collect {"Process": "storagenode", "count": 1}
And maybe I even know the reason (or at least one of the reasons) for such slow work of TTL collector. I just put my other node (where the collector continues to work now) on full monitoring of disk operations (on Windows its can be done via Process Monitor, Linux has own similar tools ofc) and found tens of thousands of attempts by the collector to delete files that already do not exist from \blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\
What is appropriate for what should happen - its match the mass deletion of the SL satellite test data after the expiration date. The only problem is itās repeated attempts to delete what has already been deleted.
Probably, the problem may be that after the actual deletion of files from the disk, the corresponding records about the pieces are not deleted from the database. And the next pass of the TTL collector tries to delete them again. These operations are performed fairly quickly (because there is no actual file deletion happens, the node only receives a response that the requested file does not exist from the file system). But if you try to delete hundreds of thousands of non-existent files with every passā¦
My own count was ~50k deletes of missing files for log covering only ~20min of work before i turned of I/O monitoring as it eats a lot of recurses.
Can someone check this on their node too and create a report on Github?
For today, I am too tired of tests, debugging and writing reports.
Besides, I think that devs will soon start ignoring my reports due to the fact that I have already created too many of them if i donāt stop.
"Time of Day","Process Name","PID","Operation","Path","Result","Duration","Detail"
"8:28:00,9277152","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\h5\iijtcvtjtyl6rgdimz3fql5ixy54wul7imaurzb2i2ujtf4r6q.sj1","NAME NOT FOUND","0.0000387","Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,0317587","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\kb\xrcmvecbv664iajupaek4a5updfyjuimpbyrftxn3bs6qe2l3a.sj1","NAME NOT FOUND","0.0000611","Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,0319557","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\kb\xrcmvecbv664iajupaek4a5updfyjuimpbyrftxn3bs6qe2l3a.sj1","NAME NOT FOUND","0.0000361","Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,2721522","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\tj\b7spx5ctmdgyf3cue6ecbhybcweghpzosc76og5qz2f2mu7noq.sj1","NAME NOT FOUND","0.0000585","Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,2723549","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\tj\b7spx5ctmdgyf3cue6ecbhybcweghpzosc76og5qz2f2mu7noq.sj1","NAME NOT FOUND","0.0000384","Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,4125799","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\c5\lfgekq7uq5omzyoamdig65fuhwlohkpbnwdnrvnobpegun2bea.sj1","NAME NOT FOUND","0.0000665","Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,4127970","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\c5\lfgekq7uq5omzyoamdig65fuhwlohkpbnwdnrvnobpegun2bea.sj1","NAME NOT FOUND","0.0000397","Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,5278538","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\gx\r5hrmexivh7u6j67uer3awo5kjgsd2uahg2sitomktwge4kjeq.sj1","NAME NOT FOUND","0.0000637","Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,5280790","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\gx\r5hrmexivh7u6j67uer3awo5kjgsd2uahg2sitomktwge4kjeq.sj1","NAME NOT FOUND","0.0000428","Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,6408087","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\ot\4dyewgh3mel4utas2sr2s4exhwgkbw7vf2hjcip64lav7g2cfa.sj1","NAME NOT FOUND","0.0000828","Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,6410667","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\ot\4dyewgh3mel4utas2sr2s4exhwgkbw7vf2hjcip64lav7g2cfa.sj1","NAME NOT FOUND","0.0000448","Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,7199715","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\la\ynakbwcc7u7m3kudvxihr3wu25mozal7mkpdmwtqctopqxvs6a.sj1","NAME NOT FOUND","0.0000617","Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,7201794","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\la\ynakbwcc7u7m3kudvxihr3wu25mozal7mkpdmwtqctopqxvs6a.sj1","NAME NOT FOUND","0.0000460","Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,8857937","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\33\r34vc6pnrtoicb7xkd7zo4hgr4r73rvwaysjuznd24c3c2ravq.sj1","NAME NOT FOUND","0.0000595","Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,8860006","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\33\r34vc6pnrtoicb7xkd7zo4hgr4r73rvwaysjuznd24c3c2ravq.sj1","NAME NOT FOUND","0.0000423","Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,9854566","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\zh\xii7o3ij7tuog26gndkgbf5fufmwj4tukr2fyn7t7nkcor5ikq.sj1","NAME NOT FOUND","0.0000512","Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:01,9856444","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\zh\xii7o3ij7tuog26gndkgbf5fufmwj4tukr2fyn7t7nkcor5ikq.sj1","NAME NOT FOUND","0.0000412","Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
"8:28:02,0630128","storagenode.exe","3676","CreateFile","F:\Storj_Data\blobs\pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa\2t\6ma4kkm4ac3nfea4aqzb4qbslk75bxcwuo5dpb5ub2rivkdblq.sj1","NAME NOT FOUND","0.0000592","Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a"
Please donāt stop. Your observations and investigations are phenomenal and exactly what the code needs.
Yes, you are correct. We made a mistake when uploaded the TTL data:
So, some data was replaced. The replace mean - delete the object and upload it back⦠So, the GC and the trash would be involvedā¦
So. Itās possible that the deleted TTL data was collected by the retain and sent to the trash. Thus the TTL collector cannot find it now (because itās in the trash).
The most progress I have seen in weeks. Thank you so much, hopefully were getting somewhere
CC
100% correct. I donāt have the entire timeline in mind. We fixed that problem and at some point the warn messages should go away. Just 30 days after we fixed it.
@BrightSilence I forgot one very important question. The rollout of the new 107 version is ongoing. The 106 version has a bug with TTL deletes. It will still delete the files from the disk and free up the space but it doesnāt update the cache. So it will look like more and more unpaid space on disk but in reality it is just a visual bug on the dashboard and I believe your script is using the same data.
To make this even more fun you canāt just run the used space file walker to correct it because of the high grow rate we have seen 30 days ago. Lets say your node was growing by 1 TB per day. The consequence is by the time the used space filewalker has finished there is already another 1 TB that was cleaned up by the TTL cleanup job without correcting the numbers on the dashboard. That would match your observation. Even after running the used space file walker it still looks like a few TB of unpaid space.
The fix for this bug is already in the 107 version. Now I am sure other loud people will point out that the rollout takes so long. So before I run away from this thread as far away as possible I want to explain yet again why the rollout is slow by design. All it takes to destroy the entire network is a bug in the storage node software and a fast rollout. At that point we can close the company and all go home with no payout. That risk is too high. In fact it is the highest risk we have. As a consequence there is basically no bugfix that would justify taking the risk of a fast rollout.
Edit: @Alexey I believe we fixed the file override issue on the 14th. So for 7 more days we should see warning messages in the logs and than it should suddenly stop.
I understand that though. But that first 5.5TB has consistently been there. Iām sure the other 2.5TB can be explained by running behind on TTL deletes and the effect of TTL deletes during file walker runs. But that leaves a significant additional gap. Iām sure that will eventually be deleted by TTL as well, but why wait for that if GC can free up that much data much earlier?
I think it is too late for that now. If we move the pieces into trash now they still stay there for 7 days. The TTL cleanup should also catch up in the next 7 days.
Edit: I am sure there will be more bugs. I donāt know if it makes sense to spend time on debugging it now while we would have problems to adjust our observation to the known issues. In 7 days from now we will have less noise to deal with and can question what other bugs might be out there a lot easier.
2th edit: But if you wish I can write down some steps that would allow us to get a better understanding of where the used space difference is located. It might be outside the SLC satellite namespace.
Alright, Iāll monitor. But I have a feeling this isnāt just about the overwrites. It seems a lot of pieces end up on nodes even if the transfer is ultimately canceled. So chances are the issue will remain. And if that amount stays significant, bloom filters will still be required to not waste space.
I understand. Since we have to store data in trash anyway for 7 days. I can wait it out and see if TTL and v1.107 indeed takes care of it in the next week. Though my gut says it wonāt.
I understand the slow rollout btw, thatās not something you ever want to rush. Iām patient when I know things will be resolved. It just seemed this wasnāt being resolved. But weāll keep in touch.
And this is the right thing to do: spend more time testing, donāt roll out broken versions, or you will have even more trouble fixing what was broken by the new release. This is an obvious thing. @Mitsos is 100% right
Does Storj have a local test satellite, and do you test updates on it before sending a new version to 25k nodes? If not, it is just a matter of time before the worst happens. Using a scheme with 3 production satellites plus 1 test satellite wonāt help because every update affects every node on all 4 satellites. I hope everyone understands that.
Is there any reason why the trash filewalker is launched only once every 24 hours? The node stores trash for 7 days, and trash emptying is already a slow procedure. To make it even slower, the node may delay the start of emptying for another 24 hours in the worst-case scenario. I hope this will be changed to 1 hour (I created a bug report yesterday).
Please stop pining me with complains. I would like to spend my time on fixing things. Just continue venting without me please.
the thing is, at the moment, they just donāt have that luxury which is time. The crucial clients are triggered, excited and waiting. So the pressure is, as i can imagine, crushing.
i just want to highlight that storj DID address SNOs problems with discrepancy, and keeps addressing it, just the slow rollout forces us all to wait weeks to see changes in our nodes
sometimes just to see if that solved the problem for real, or need more revision. So there you have it, a slow and careful process that you so advocate.
I am not a smart one here, but could the lack of bloom filters from SLC be because itās supposed to be all TTL data now? Even if maybe really we could benefit from continued bloom filters from it?
If a TTL piece isnāt deleted (for any reason) according to its TTL value, then a bloom filter is needed to garbage collect it into trash (and then deleted according to the trash-cleanup schedule).
Yes, Storj has a QA satellite and tests updates before each release.