Sorry if I created the same topic, but I could not find the resolution for my situation and not much time left.
My node after 1h is not responding to QUIC (it is online you can see the dashboard but without any data) and it is showing QUIC in red “misconfigured”. It is configured correctly as I did not change anything and a node has been working for 1-2 years. Checked both ports 14002 and 28967 open. After 1h 28967 is showing clouse.
It already went from 97% (yellow) on all servers to 95(red). That is not an issue with port forwarding.
Something with disk or some corrupted data. Logs telling/pointing “… This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity; filestore error: config/storage/trash/ukfu6bhbboxilvt7jrwlqk7”.
So I should stop the node and run some program on Linux to check file integrity? Scan for bad sectors or remove some DB files and resync, but that will take a lot of time. What can I do to quickly fix the issue to not get permanently disqualified?
I believe that probably it is just a corrupted file, not a bad disk but how I can verify that?
p.s. The disk is correct, not SMR.
Please help
I cut all sensitive data, I think.
Logs and errors:
Thank you @daki82 you give me hope Now, I checked the file system and that HDD is set to NTFS. So that means I can take it to a Windows PC and just run error-checking. That should not mess up any stored files if the HDD is ok. Is that the correct way or appropriate program?
maybe wait a month to recover the online score.
if it does not run fine after,
Consider copying the data to a safe place after that, reformat the drive to ext4 and copy the data back, run the node with the right parameters for the data path.
ntfs and linux is an odd combo, though i dont know if it matters for 2 tb.
think on leaving 10%free space on the disk too.
Disk tested on the windows there was some errors that was fixed automatically after scan. Installed on the node for now it is working and I do not see those error anymore. I will check again after 3-5h and let you know but that looks like it works agin
If you use Windows 10 or newer, please disable deduplication and compression for the disk before processing it under Windows, otherwise it could become incompatible with your Linux, and it will have corruptions under Linux every time when you use it under Linux.
For example:
Hi there all seems to be working correctly, thank you @daki82 . The discussion can be close. Hi @Alexey thank you for the article. I need to check that, as my all storj nodes has NTFS partitions for customer data. I did that on beginning as was thinking easier to work connect to windows do scan or any other maintenance HDD. I do not know very well Linux and command interface related to disk that why I choose NTFS. Again thank you both for support/help
Sorry to bother you again unfortunately it work for a few days and now again, the nod is showing offline after 1h when I restart it.
showing below error:
Probably not as I still get the issue:
2024-02-04T18:32:58Z INFO http: Accept error: accept tcp [::]:28967: accept4: too many open files; retrying in 1s {“process”: “storagenode”}
Ensure that the gateway binary is in /usr/local/bin and is named gateway (or edit the ExecStart line below to reflect the name and location of your binary
I try to follow the “scripts/gateway.service” instructions but I’m doing something wrong as I do not have an entry at /usr/local/bin/ for the gateway
Ensure that you’ve run setup and have edited the configuration appropriately prior to starting the service with this script and adjust the --config-dir option in the ExecSart line if necessary
Looks like I do not have access to that file. I try the below command. How should I edit that command to get access?
I also try manually edit that “#DefaultLimitNOFILE=16777216”
I think I’ve done that much, but unfortunately, looks like it is still not enough
You do not need to install a gateway, it was just example for the similar configuration.
The parameter --ulimit nofile=262144:262144 in your docker run command should work.
That flag helped as I reset the node at 21:35 and then at 00:01 went offline according to ping time out xxx:28967
I was running longer than before when I used that command. Unfortunately looks like there is still some issue. I attached below all the errors I can find in the logs:
You are welcome!
The online score will fully recover after 30 days online. Each downtime requires an additional 30 days to recover.
I’m more worried about not 100% audit score. Please check your logs for errors related to audits: