Few min ago one my node got strange load:
I collected information from debug port:
curl -s localhost:7777/mon/ps > /tmp/result.txt
Few min ago one my node got strange load:
I collected information from debug port:
curl -s localhost:7777/mon/ps > /tmp/result.txt
but my node is not crashed, now load is dropping:
and node ps can show normal output:
curl -s localhost:7777/mon/ps
[4855391130671764778] storj.io/storj/pkg/process.root() (elapsed: 192h47m2.345983103s)
[3791477542259197051] storj.io/storj/storagenode.(*Peer).Run() (elapsed: 192h47m0.903888308s)
[4502346184408064397] storj.io/storj/pkg/server.(*Server).Run() (elapsed: 192h47m0.291775956s)
[8582672752331851760] storj.io/storj/private/version/checker.(*Service).Run() (elapsed: 192h47m0.290965621s)
[8936912386000918787] storj.io/storj/storagenode/bandwidth.(*Service).Run() (elapsed: 192h47m0.291707171s)
[7163324842844850360] storj.io/storj/storagenode/collector.(*Service).Run() (elapsed: 192h47m0.291668141s)
[5567454460225998770] storj.io/storj/storagenode/console/consoleserver.(*Server).Run() (elapsed: 192h47m0.290566934s)
[6453650888101349660] storj.io/storj/storagenode/contact.(*Chore).Run() (elapsed: 192h47m0.29174229s)
[6984412994902266878] storj.io/storj/storagenode/gracefulexit.(*Chore).Run() (elapsed: 192h47m0.292354227s)
[7694086949645767578] storj.io/storj/storagenode/monitor.(*Service).Run() (elapsed: 192h47m0.29232423s)
[5389737299688781933] storj.io/storj/storagenode/orders.(*Service).Run() (elapsed: 192h47m0.291652855s)
[244302455947060197] storj.io/storj/storagenode/pieces.(*CacheService).Run() (elapsed: 192h47m0.292358616s)
[826013095517286765] storj.io/storj/storagenode/piecestore.live-request() (elapsed: 4.749977719s)
[2067643844467071328] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload() (elapsed: 4.74995633s)
[6149165099796225338] storj.io/storj/storagenode/piecestore.live-request() (elapsed: 1.951438616s)
[7390795848746009901] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload() (elapsed: 1.951387085s)
I found the root cause of this load, it was cleaning the trash out of the bin after 7 days:
used space before:
used space after:
~740GB was cleaned!
The same situation with load every portion of cleaning trash:
It really extremely hiload, I afraid not all nodes can survive.
After 10min:
curl -s localhost:7777/mon/ps
have a lot of this sections:
[2123516809455615763] storj.io/storj/storagenode/piecestore.live-request() (elapsed: 18m14.504998729s)
[6009436491625855895] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload() (elapsed: 18m14.505009388s)
[717481803337676253] storj.io/storj/storagenode/pieces.(*Writer).Commit() (elapsed: 18m14.403496472s)
[4603401485507916384] storj.io/storj/storage/filestore.(*blobWriter).Commit() (elapsed: 18m14.403463193s)
[8489321167678156516] storj.io/storj/storage/filestore.(*Dir).Commit() (elapsed: 18m14.403458266s)
curl -s localhost:7777/mon/ps |grep -c "Writer"
1178
It is really crazy.
Please pay attention!
it can make node almost unresponcible when garbige collector at work on small power nodes.
Desctop pc will win on this. On other side, today garbige collector delete after 7 days, after it will delete file as satellite will tel it to delete and will not be so many ant one time as now.
After 35 min:
curl -s localhost:7777/mon/ps |grep -c "Writer"
1890
35 min deleting files?
After 40min:
New alerts:
- Pool tank state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
Current alerts:
- Pool tank state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
It really disaster!
yep, it still running
storj wiping hole node? Looks like with docker also and drives, self distruckt
After 1h, cleaning is finished and load is dropping (peak was 1200), node is survived but I loose one disk!
what configuration you have Raid1
Here is my configuration of than node: RAIDZ1 (8HDD) +L2ARC+ZIL
I will try to reproduce on my test environment, it really interesting problem…
Oh nice so you found a faulty disk!
Did you have a regular scrub set up so you find this kind of situation early?
I had a funny situation once that my drives would fault out whenever they got into a high load. Turned out that I hadn’t considered that my homemade SATA power adapter had high inductance and voltages were dropping by 20%+ during disk seek.
had same problem, changed Sata cables with locks, and power cable. working fine.
Of course, I have configured scrub on this storage and alerting also configured
This disk was kicked from RAID because the response was is too hi, this disk does not have any issues and SMART not have any issues. It is working fine until this extremely load more than half of year.
But I suspect the root cause is not on storj side, and not on the disk side.
lets me some time to finish my investigations, I will update this topic when I confirm my suspects.