GE's failing? for no apparent reason

SGC · March 10, 2023, 1:51pm

been running GEs on some satellites for my nodes and ran into this issue…

so i looked at the logs for this particular node, and this popped out at me…

|2023-03-09T00:02:41.284Z|DEBUG|gracefulexit:chore.1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE@saltlake.tardigrade.io:7777|finished|{Process: storagenode}|
|---|---|---|---|---|
|2023-03-09T00:02:41.284Z|ERROR|gracefulexit:chore|worker failed|{Process: storagenode, error: graceful exit processing interrupted (node should reconnect and continue), errorVerbose: graceful exit processing interrupted (node should reconnect and continue)\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run:90\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing.func1:82\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}|

doesn’t that mean it finished the GE or why am my nodes getting DQ because they are doing GE

this has happened a plentitude of times already, and i can’t seem to find any cause…
everything looks fine and no satellites that i’m not running GE on has been DQ…

audits also looks fine.

i got the last couple of years of logs for this node, so if there is an issue i can find it…
the GE for this node has been running for about 6 or 7 weeks thus far and still has some way to go…

Domain Name                        Node ID                                              Percent Complete  Successful  Completion Receipt
europe-north-1.tardigrade.io:7777  12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB  100.00%           Y           0a4730450221009842656c0330d744909bfa77aa14597c5c19feb93eba4dc48baf071f7574753702200aed37eeb15c49ba7ab9a9fa69744edc8e67448f2ecd0ccc9fd6e651054d1e7c1220f474535a19db00db4f8071a1be6c2551f4ded6a6e38f0818c68c68d0000000001a203379800c413968dcf8e31712dbbbb1184873e5b0fd0df5ba0111cc4000000000220c08d6efe79f0610c8bdc29801  
ap1.storj.io:7777                  121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6  26.32%            N           N/A                                                                                                                                                                                                                                                                                                                     
saltlake.tardigrade.io:7777        1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE   4.50%             N           0a4630440220634b2bccdc911d411c487c8473ddd63d06119f9d865827bacd19f40c84cbc5b70220246fa5908d5f59d70a361f954eed16455462fce4af2b239137001c88d430964410021a207b2de9d72c2e935f1918c058caaf8ed00f0581639008707317ff1bd00000000022203379800c413968dcf8e31712dbbbb1184873e5b0fd0df5ba0111cc40000000002a0b08cfaba7a00610e1e3a233

thepaul · March 10, 2023, 3:22pm

I looked up your node from the node ID in the picture, and it looks like it failed Graceful Exit on saltlake because of a period where every piece failed to transfer. The satellite asks for pieces to be transferred to specific other nodes, and your node failed to transfer every one over a period of hours (maybe more, I didn’t load enough of the server logs to see how far back it went). Look in your logs for the message “failed to put piece”, and it should have more information about why each one failed.

If a node tells the satellite that it can’t transfer some large percentage of the pieces that we ask for, then Graceful Exit fails and a disqualification is issued.

This shouldn’t happen if your node was completely offline, because then it wouldn’t be able to receive or respond to the transfer requests. So I’m not sure what would have caused the errors.

SGC · March 10, 2023, 3:35pm

this log piece contains one of those fails… i found 900 of them for yesterday.
but as you can see, the uploads and downloads function just fine…
even tho the GE uploads see
and the audit score 100%

2023-03-09T00:08:14.243Z	INFO	piecestore	upload started	{Process: storagenode, Piece ID: 32K7VKIFH6USGZNGGG4QYFJ76LWVHZ5JBMRVKMNBLMZDQCGDVDIA, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Action: PUT, Available Space: 2727541174704, Remote Address: 63.250.57.147:64550}
2023-03-09T00:08:14.370Z	INFO	piecestore	upload started	{Process: storagenode, Piece ID: QLPYHBRIM4FNMLHFFMDEDCVNGMLMJMR4FSIMJWR6ZNHIRIHN4Q7Q, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Available Space: 2727541174704, Remote Address: 63.250.57.147:58370}
2023-03-09T00:08:14.499Z	INFO	piecestore	uploaded	{Process: storagenode, Piece ID: QLPYHBRIM4FNMLHFFMDEDCVNGMLMJMR4FSIMJWR6ZNHIRIHN4Q7Q, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Size: 7936, Remote Address: 63.250.57.147:58370}
2023-03-09T00:08:14.503Z	INFO	piecestore	uploaded	{Process: storagenode, Piece ID: QB3JHZ22TYJTJMHHK47JMHSH3DK4UKG7BWC3UGMSIIXJG4WHIPQQ, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Size: 19456, Remote Address: 63.250.57.147:58030}
2023-03-09T00:08:15.374Z	INFO	piecetransfer	piece transferred to new storagenode	{Process: storagenode, Satellite ID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6, Piece ID: 3HS4EFD4FADLWOWECFSHUFQPWXYNQ6K36S6TY3XPHLL7C5PNP3CA, Storagenode ID: 1gRrYc8XPXfPkJFYFanNDFhPGTn3n2M3mUZ5j5zT5R12guK5Uf}
2023-03-09T00:08:16.056Z	INFO	piecestore	upload started	{Process: storagenode, Piece ID: AQNDWDLG4T754HK643HQ3O5LSGWIRSUZ4WMEMLX7UGWY4NERTDQA, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Action: PUT, Available Space: 2727541146288, Remote Address: 63.250.57.147:45034}
2023-03-09T00:08:16.151Z	INFO	piecestore	uploaded	{Process: storagenode, Piece ID: 4NRHYZWVUWQIEL5UIOAQXLVEQOLV6334VMSXRA2BQFW7GR27SYUA, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Action: PUT, Size: 2319360, Remote Address: 63.250.57.147:60972}
2023-03-09T00:08:16.299Z	INFO	piecetransfer	piece transferred to new storagenode	{Process: storagenode, Satellite ID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6, Piece ID: SKKFJ6NSTZ65Z7RVPGUTZBM2BBXYENQYN3JRCRSBVFG6P6LNTRYA, Storagenode ID: 12onKkBtpHy4ZhoucX5hGh1yDBVoSRieW7xN5p47912P3Tx2vJs}
2023-03-09T00:08:16.496Z	INFO	piecestore	upload started	{Process: storagenode, Piece ID: LSMH3UR5L6XQLGM4TYWY725H67YABTJVOETXCW62Z7X5O6N5PDMA, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Available Space: 2727538826416, Remote Address: 63.250.57.147:19364}
2023-03-09T00:08:16.596Z	ERROR	piecetransfer	failed to put piece	{Process: storagenode, Satellite ID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6, Piece ID: VOOKNFWCAREX7RMBOVKAMDK3C5ZCETOLWU4EFA7MFSRVW43RVLZA, Storagenode ID: 12i9ikCUT8GAHp4bV4CGwFfDzGUXBrpXdSRmeV1JwTNAxwBQh6U, error: ecclient: upload failed (node:12i9ikCUT8GAHp4bV4CGwFfDzGUXBrpXdSRmeV1JwTNAxwBQh6U, address:78.47.227.218:28967): protocol: expected piece hash; storage node overloaded, request limit: 32; EOF, errorVerbose: ecclient: upload failed (node:12i9ikCUT8GAHp4bV4CGwFfDzGUXBrpXdSRmeV1JwTNAxwBQh6U, address:78.47.227.218:28967): protocol: expected piece hash; storage node overloaded, request limit: 32; EOF\n\tstorj.io/uplink/private/ecclient.(ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(Worker).Run.func3:100\n\tstorj.io/common/sync2.(Limiter).Go.func1:49}
2023-03-09T00:08:16.630Z	INFO	piecestore	uploaded	{Process: storagenode, Piece ID: LSMH3UR5L6XQLGM4TYWY725H67YABTJVOETXCW62Z7X5O6N5PDMA, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Size: 6912, Remote Address: 63.250.57.147:19364}
2023-03-09T00:08:16.764Z	INFO	piecetransfer	piece transferred to new storagenode	{Process: storagenode, Satellite ID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6, Piece ID: SBDE6MIJJG36A4JBBJB3VTSE4GW4QMULDEAI4J7WCMGZ2TGLYNLQ, Storagenode ID: 1jnjadKn5xPgtRr7VguuQa2MqzeQhyNXEAp4nmYrWzJ4gzehTL}
2023-03-09T00:08:16.940Z	INFO	piecestore	uploaded	{Process: storagenode, Piece ID: AQNDWDLG4T754HK643HQ3O5LSGWIRSUZ4WMEMLX7UGWY4NERTDQA, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Action: PUT, Size: 222976, Remote Address: 63.250.57.147:45034}
2023-03-09T00:08:17.347Z	INFO	piecestore	upload started	{Process: storagenode, Piece ID: WTUD7Y4SLHYFVBIF6XS6ZJJ2QHHRDE2JHF5AIYPPDX5MBDWR747Q, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Available Space: 2727538595504, Remote Address: 63.250.57.147:10608}
2023-03-09T00:08:17.375Z	INFO	piecestore	download started	{Process: storagenode, Piece ID: CNRFJQVULN7D5CS4EAJHLZPAKXGTWJ5YOBM3VCABSZD2BUG66JCQ, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Action: GET, Offset: 57600, Size: 2116608, Remote Address: 63.250.57.147:61202}
2023-03-09T00:08:17.515Z	INFO	piecestore	uploaded	{Process: storagenode, Piece ID: WTUD7Y4SLHYFVBIF6XS6ZJJ2QHHRDE2JHF5AIYPPDX5MBDWR747Q, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Size: 46336, Remote Address: 63.250.57.147:10608}
2023-03-09T00:08:17.525Z	INFO	piecetransfer	piece transferred to new storagenode	{Process: storagenode, Satellite ID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6, Piece ID: HUVXTHUHCVTXEJCBHMWKCXITXKP4SBK5X2DAKF6IWTKI66UTM5HQ, Storagenode ID: 12Hq4E5wWAPzrxZacP6gbFpKiBvbRXnPdSR4z9jRU443yCFqS2r}
2023-03-09T00:08:17.654Z	INFO	piecestore	upload started	{Process: storagenode, Piece ID: 6INBC24KKXLCVJL4NTJW45TKCPYNAVMGYYOZN7KYCMOMEN3TPE6A, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Action: PUT, Available Space: 2727538548656, Remote Address: 63.250.57.147:50432}
2023-03-09T00:08:17.812Z	INFO	piecestore	uploaded	{Process: storagenode, Piece ID: 6INBC24KKXLCVJL4NTJW45TKCPYNAVMGYYOZN7KYCMOMEN3TPE6A, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Action: PUT, Size: 37632, Remote Address: 63.250.57.147:50432}
2023-03-09T00:08:18.108Z	INFO	piecestore	downloaded	{Process: storagenode, Piece ID: CNRFJQVULN7D5CS4EAJHLZPAKXGTWJ5YOBM3VCABSZD2BUG66JCQ, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Action: GET, Offset: 57600, Size: 2174464, Remote Address: 63.250.57.147:61202}
2023-03-09T00:08:18.144Z	INFO	piecestore	upload started	{Process: storagenode, Piece ID: SEQQ55HUCTWGHDKQHPLHDNQEUOME3RHSTV4RSEO4R74FGXDR2PAQ, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Available Space: 2727536190640, Remote Address: 63.250.57.147:21198}
2023-03-09T00:08:18.168Z	INFO	piecestore	upload started	{Process: storagenode, Piece ID: CCS4EFCJBYZSOEPGW77JUQGSEN5M3I2KSFL2JTOUG4XU3COTETAA, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Available Space: 2727536190640, Remote Address: 63.250.57.147:15122}
2023-03-09T00:08:18.194Z	INFO	piecestore	uploaded	{Process: storagenode, Piece ID: 32K7VKIFH6USGZNGGG4QYFJ76LWVHZ5JBMRVKMNBLMZDQCGDVDIA, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Action: PUT, Size: 2319360, Remote Address: 63.250.57.147:64550}
2023-03-09T00:08:18.307Z	INFO	piecestore	uploaded	{Process: storagenode, Piece ID: CCS4EFCJBYZSOEPGW77JUQGSEN5M3I2KSFL2JTOUG4XU3COTETAA, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Size: 12544, Remote Address: 63.250.57.147:15122}
2023-03-09T00:08:18.449Z	INFO	piecestore	upload started	{Process: storagenode, Piece ID: BZERXCF7U5VG6SNN66N2ZE7A7SMWLZPOYIUPLQTHVY3MGB2JASGQ, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Available Space: 2727536177584, Remote Address: 63.250.57.147:35364}
2023-03-09T00:08:18.508Z	INFO	piecestore	download started	{Process: storagenode, Piece ID: 5ALPWZYTZRQWC7J53RFYCUURKCMJLFNSDOABCR4L5Q62YJS5MRQA, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Action: GET, Offset: 1233664, Size: 940544, Remote Address: 63.250.57.147:61218}
2023-03-09T00:08:18.576Z	INFO	piecestore	uploaded	{Process: storagenode, Piece ID: SEQQ55HUCTWGHDKQHPLHDNQEUOME3RHSTV4RSEO4R74FGXDR2PAQ, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Size: 24320, Remote Address: 63.250.57.147:21198}
2023-03-09T00:08:18.726Z	INFO	piecestore	uploaded	{Process: storagenode, Piece ID: BZERXCF7U5VG6SNN66N2ZE7A7SMWLZPOYIUPLQTHVY3MGB2JASGQ, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Size: 67328, Remote Address: 63.250.57.147:35364}
2023-03-09T00:08:18.853Z	INFO	piecedeleter	delete piece sent to trash	{Process: storagenode, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Piece ID: 5RBXUWQB56I2LTDZCUW63TXN4TGG7BWJQDTI55W6RZL6XVGJHYBA}
2023-03-09T00:08:18.917Z	INFO	piecestore	upload started	{Process: storagenode, Piece ID: H4OTAHE3VH5TH437UHGYI23I2BYESW243SSXFXYLESZBZ55OV2PA, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Action: PUT, Available Space: 2727536084912, Remote Address: 63.250.57.147:56860}

SGC · March 10, 2023, 3:36pm

this one is kind of interesting…
it seemed to try to upload the file a couple of times without success


2023-03-09T00:08:16.596Z	ERROR	piecetransfer	failed to put piece	{"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Piece ID": "VOOKNFWCAREX7RMBOVKAMDK3C5ZCETOLWU4EFA7MFSRVW43RVLZA", "Storagenode ID": "12i9ikCUT8GAHp4bV4CGwFfDzGUXBrpXdSRmeV1JwTNAxwBQh6U", "error": "ecclient: upload failed (node:12i9ikCUT8GAHp4bV4CGwFfDzGUXBrpXdSRmeV1JwTNAxwBQh6U, address:78.47.227.218:28967): protocol: expected piece hash; storage node overloaded, request limit: 32; EOF", "errorVerbose": "ecclient: upload failed (node:12i9ikCUT8GAHp4bV4CGwFfDzGUXBrpXdSRmeV1JwTNAxwBQh6U, address:78.47.227.218:28967): protocol: expected piece hash; storage node overloaded, request limit: 32; EOF\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49"}

thepaul · March 10, 2023, 3:38pm

Yeah, it’s expected that a few transfers will fail because of the remote nodes. This one seems pretty harmless. It’s the 900 errors from yesterday that seem to have caused the problem.

SGC · March 10, 2023, 3:39pm

but everything is functioning normally on my end…
the node uploads and downloads pieces all the time… its only the GE that seems to fail…
for whatever reason

thepaul · March 10, 2023, 3:41pm

What do the yesterday errors say?

If there is a case where large numbers of outbound transfers fail all in a row and it’s not a problem with the originating node, maybe we can learn to detect that case and avoid penalizing the node.

SGC · March 10, 2023, 3:41pm

the successrates looks pretty reasonable.
i can send you the entire log if you want…

./successrate.sh 2023-03-10-sn0003.log
========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            1227
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                882
Fail Rate:             3.189%
Canceled:              5329
Cancel Rate:           19.270%
Successful:            21443
Success Rate:          77.540%
========== UPLOAD =============
Rejected:              0
Acceptance Rate:       100.000%
---------- accepted -----------
Failed:                359
Fail Rate:             0.693%
Canceled:              125
Cancel Rate:           0.241%
Successful:            51298
Success Rate:          99.065%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            3554
Success Rate:          100.000%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            1713
Success Rate:          100.000%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            23893
Success Rate:          100.000%

thepaul · March 10, 2023, 3:44pm

Sure, the log might help.

SGC · March 10, 2023, 3:48pm

the server is also running within spec…

and the failure of GE’s have been an ongoing issue for weeks now across the nodes i set to run it… but only the GE satellites get DQ.

you got an email you want me to send the log to?

let me know if you want another date, because i got all of them

thepaul · March 10, 2023, 3:50pm

<thepaul at storj.io> should work. The most relevant period is probably from 2023-03-08 12:50:00 UTC - 2023-03-09 12:50:00 UTC.

SGC · March 10, 2023, 4:01pm

okay logs for the 8th, 9th and 10th have been sent to your email.

SGC · March 10, 2023, 4:13pm

initially i through it might be bandwidth related, because i only have 1Gbit sync fiber internet.
but that should and does seem to be able to supply about 800mbit in each direction and the LAN is faster than the internet.

and as this graph shows in MB/s the network usage seems to be going down…
while the DQ of satellites being GE continues

i guess it could be bandwidth… but that would just be odd i think…
from what i’ve tested my internet should be faster than what these graphs shows.
but maybe it isn’t… but still its way beyond the recommended 5mbit upload for each node.

MattJE96011 · March 10, 2023, 5:21pm

Says right in the error the node is overloaded. Maybe I’m stating the obvious here but wouldn’t that explain why transfers would be failing? There’s got to be a bottleneck somewhere.

Clearly based on your bandwidth you must have multiple proxied nodes or you’d never have that much. Are these nodes all on separate drives? Also looks like the system might have went down there based on the missing data on the graph. Was that node or any others sharing the drive restarted and causing the filewalker process to run?

MattJE96011 · March 10, 2023, 5:39pm

Also, what sort of router are you using? With that kind of sustained throughput you might have the CPU on your router pegged.

SGC · March 10, 2023, 5:42pm

i had an outage around that time…wasn’t for multiple days tho…
not sure why the graph shows like that.

yeah i run multiple nodes, restarting them all at one time can be a bit rough, but its over in 6 hours, each batch run the filewalker in a couple of hours.

i run 5 x 6 drives Raidz1 with 2x NVME 1.6TB special vdev for metadata and small files

slog and L2ARC on seperate sata SSD’s + 224GB RAM in the server
usually doing less than 10k IOPS, tho during filewalkers running it will do 500000 IOPS for 6 hours until its done

its become a well tuned setup over the four years i’ve been running it and haven’t seen a satellite get DQ before, so it would seem odd that it would start now when doing GE’s when everything else seems to run within spec.

the router is running on a pfsense vm on my dual socket server with 2 x 1st gen EPYC’s 32core setup with dual 10gbit nics
its happy as a clam and can easily keep up.

SGC · March 10, 2023, 5:54pm

for a while i also considered that the DQ might be correlated with nodes updating / restarting, but i investigated that and it was happening to nodes that hadn’t been restarted in over a week…
so that clearly wasn’t related.

Alexey · March 11, 2023, 5:10am

7 posts were split to a new topic: Zfs + pfsense setup for multiple nodes

MattJE96011 · March 10, 2023, 6:12pm

So you’ve had others DQ’d that you weren’t trying to exit on?

SGC · March 10, 2023, 6:18pm

i’ve tinkered a good deal with the pfsense setup… tried passthrough for the nics to the vm’s and even for a while running pfsense on another dedicated host, in the end i went back to the setup i’m running now, where its a vm, because its nice to be able to migrate the router around between servers, so i can shut them down when i need to…

i love and hate my virtual pfsense, but it works pretty good…

ofc when i have a power outage or such, its a bit annoying if there are issues.
have been thinking about getting replication setup on the pfsense vm, so that if one server breaks the other one can take over without issue…

have also thought about getting some dedicated pfsense gear… but it’s been working pretty good after i got familiar with how to run pfsense as a vm.

Zfs + pfsense setup for multiple nodes

Until just recently I also ran nodes on zpools for years, but as the data grew I started getting up there in IO limitations. I considered SSD cache’s which I figured would help, but only to a point. Long term just seemed like a bad idea so I decided to separate them onto their own dedicated spinners. Unfortunately though, and this was one of my main concerns for the future, it took ~3 - 4 months to transfer all the data off 2 zpools due to the pool IO limitations. NOT fun, lol. Actually the last nodes will be done in about an hour… finally. The problem with the cache’s is it only stores more frequently accessed data, so as soon as you have to serve up a bunch of the rest like in a GE or moving a node, that cache isn’t doing anything for you.

yeah my special small block vdevs take care of the metadata and small file writes, those help a ton… not sure my setup would run without them…
but it also really causes a ton of wear on the SSD’s.

Storj write IO is kinda insane, but the special small block vdevs are magic.

so much SSD wear in my case tho.

 ioMemory Adapter Controller, Product Number:00D8431, SN:1504G0637
        ioMemory Adapter Controller, PN:00AE988
        Microcode Versions: App:0.0.15.0
        Powerloss protection: protected
        PCI:07:00.0, Slot Number:53
        Vendor:1aed, Device:3002, Sub vendor:1014, Sub device:4d3
        Firmware v8.9.8, rev 20161119 Public
        1600.00 GBytes device size
        Format: v501, 3125000000 sectors of 512 bytes
        PCIe slot available power: 25.00W
        PCIe negotiated link: 8 lanes at 5.0 Gt/sec each, 4000.00 MBytes/sec total
        Internal temperature: 49.22 degC, max 60.54 degC
        Internal voltage: avg 1.01V, max 1.01V
        Aux voltage: avg 1.79V, max 1.81V
        Reserve space status: Healthy; Reserves: 76.36%, warn at 10.00%
        Active media: 97.00%
        Rated PBW: 5.50 PB, 20.59% remaining
        Lifetime data volumes:
           Physical bytes written: 4,367,487,883,457,392
           Physical bytes read   : 3,667,873,249,147,968

will have to replace this SSD in the coming weeks… already got
a Intel DC P4600 3.2TB drive with like 15-30PBw endurance ready to replace it with…
just been waiting for my GE’s to finish.

Storj Migrations are hell… takes forever even on non ZFS setups.
tho using zfs send | zfs recv
runs like 10x or 12x faster than rsync, because it will move the data sequentially.
so when i do migrate nodes i will use that.

nope never had a DQ before, barely even had any real data loss in the 4-5 years i’ve been running storagenodes.

did use sync=disabled on zfs for extra speed at times, which causes some light data loss, but audits have never dipped below 99.97% or so
sync=disable is pretty bad if something goes wrong, like a power outage or server stall or whatever else hell that can happen…

everything have been running smoothly for a long time now… else i wouldn’t have tried to do the GEs

and really not that i care anyways… the held amount is basically nothing…
but GE is a feature that should work correctly.
so now that a node that actually was being logged got hit by the issue i figured i would see if StorjLabs had some insights into why it was going wrong.

because i can’t figure it out.