Raspberry 3/4 Node Owner - Did you do any optimizations?

P1R4T3 · February 12, 2020, 9:24am

How to set up noatime? This is my config:
/dev/sda1 /mnt/storagenode2 ext4 defaults,noatime 0 2

but when i write mount command it says: relatime

naxbc · February 12, 2020, 10:03am

You need to write it in /etc/fstab and then reboot.
That´s it.

P1R4T3 · February 12, 2020, 10:33am

I did that, but forgot to reboot lol. Thanks

xopok · February 12, 2020, 12:36pm

@naxbc, this will be fairly easy to confirm.
Tonight I’ll stop the node on the RPi, plug the same HDD to the home beast PC and run the node there. Will see if it’d make a difference to the upload success rates. Stay tuned

anon27637763 · February 12, 2020, 12:38pm

The SoC boards share bus bandwidth between peripherals.

So yes indeed… the RPi boards are going to be slower to respond… especially in high traffic network conditions since the Ethernet port will consume larger portions of the USB port’s bandwidth.

naxbc · February 12, 2020, 12:43pm

It will
You don’t need to test, as I have 2 Synology, 2 Microservers and 2 RPIs

BrightSilence · February 12, 2020, 1:19pm

If you do this, don’t forget you also need to use the same identity on the other system

twl · February 12, 2020, 1:22pm

The Raspi 4 actually does have a PHY Ethernet chip which does not share USB bandwidth any more.

P1R4T3 · February 12, 2020, 1:38pm

guys what is your IO wait score?
i have 4.6 - 4.8

xopok · February 12, 2020, 1:46pm

@anon27637763 unlike RPi 1…3, this is not true for RPi 4 anymore.
To some extent, of course: peripherals have to share some bus but it’s not the single USB2.0 bus for both Ethernet and USB these days.

I still don’t have a good idea of what RPi4 might be missing which is clearly present in @naxbc 's Synology. I checked my system load and it’s next to zero, HDD’s utilization is less than 1% with svctm consistently below 10ms, CPU is mostly idling, network is free, ping to the West Coast is a bit over 100ms now.

You said you have >80% success rate. Let’s look at my 25th percentile for failures.

$ curl localhost:7777/mon/funcs
...
[...] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload
  parents: ...
  current: 1, highwater: 6, success: 5689, errors: 55162, panics: 0
  error grpc_Internal: 55162
  success times:
    0.00: 15.202296ms
    0.10: 34.873384ms
    0.25: 62.484791ms
    0.50: 476.880576ms
    0.75: 1.110103072s
    0.90: 3.235110912s
    0.95: 3.980484979s
    1.00: 5m1.19428096s
    avg: 1.92212264s
    ravg: 5.661211648s
  failure times:
    0.00: 38.197128ms
    0.10: 45.271315ms
    0.25: 56.481385ms  <-- Lower than 1 roundtrip Europe <-> US!
    0.50: 434.786048ms
    0.75: 1.097342608s
    0.90: 3.935681408s
    0.95: 5.744314905s
    1.00: 13.595103232s
    avg: 1.517489839s
    ravg: 1.27996352s

I’m sure I need to take these numbers with a grain of salt but the 25th percentile of all failed upload attempts is 56ms!
It means that by the time the uploader sends me a “context cancelled”, my end of the transatlantic fiber cable probably didn’t even receive the first byte of the upload.
Whatever powerful hardware I install in place of this RPI I won’t be able to win these uploads.

Do you have other ideas what might go wrong?

xopok · February 12, 2020, 1:54pm

Thanks for the reminder
When I need to do some time-consuming maintenance on my RPi I always run the same exact node on the PC.

Btw, @BrightSilence do you happen to know why identity files are stored separately from the storage/ directory? They are kinda tied together for the remainder of the node’s life.

anon27637763 · February 12, 2020, 2:11pm

…

My Feb 2020 Stats so far:

Here is the output of successrate.sh for my node since Feb 2nd 2020:

========== AUDIT ============= 
Successful:           6708 
Recoverable failed:   0 
Unrecoverable failed: 0 
Success Rate Min:     100.000%
Success Rate Max:     100.000%
========== DOWNLOAD ========== 
Successful:           30359 
Failed:               14 
Success Rate:         99.954%
========== UPLOAD ============ 
Successful:           162923 
Rejected:             0 
Failed:               42634 
Acceptance Rate:      100.000%
Success Rate:         79.259%
========== REPAIR DOWNLOAD === 
Successful:           1563 
Failed:               0 
Success Rate:         100.000%
========== REPAIR UPLOAD ===== 
Successful:           6989 
Failed:               1371 
Success Rate:         83.600%

If I only look at Stefan:

cat node.log |grep 118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW |grep -c "uploaded"
8695

cat node.log |grep 118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW |grep -c "upload\ failed"
2128

8695/(8695+2128) = 0.80338

So, yes indeed, my node sitting in the Northeast US is successful in catching 80% of offered uploads from the 118 satellite sitting in Germany.

RPi Hardware Question

There are many factors in how nodes may catch data pieces. Perhaps just a few tiny differences in processing speed or overall network latency may be increasing my node’s success rate versus other nodes. However, any RPi board connected to a USB HDD is going to be slower at catching data on the same network chain as my node which is running dual XEON processor server hardware and very fast SAS RAID drives.

My hardware is not representative of general consumer hardware. However, perhaps this particular month is the dividing moment when the data on the Storj network begins its inevitable centralization into mostly datacenter nodes.

It should be noted that most ISPs dole out low priority bandwidth to consumer priced end-users. So, there may be nothing that some node operators can do to change the external WAN network connection bottlenecks. Also, in some geographical areas, WAN Internet is run over cellular networks before connecting back to cables.

In short, there are simply too many variables to be considered to make an apples-to-apples comparison between Node A and Node B.

xopok · February 12, 2020, 2:22pm

^^^ THIS ^^^ is the most and the foremost and probably the only relevant part for our investigation

It doesn’t matter what satellite the client initiates uploads with, the actual data is sent directly from the client.

Your dual Xeon with SAS connected raid drives is probably 1ms faster at writing the user data to the disk cache before the write is considered done. The remaining 99ms of the latency difference come from location.

anon27637763 · February 12, 2020, 2:26pm

I’ve looked at the connecting IP addresses…

For the most part… when I was looking last week… the connecting clients were mostly Central Europe.

xopok · February 12, 2020, 3:08pm

If you happen to have your 7777 port open to the debug interface can you post your curl localhost:7777/mon/funcs around the doUpload function? Would be interesting to see your percentiles.

It’s a pity of course that the docker log doesn’t contain peers’ IPs…

anon27637763 · February 12, 2020, 3:54pm

I hadn’t been running the debug mode… figured it would increase resource pull and/or latency…

Here are my doUpload stats after about 2 minutes…

[7870591835146771948] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload
  parents: 6044685167442717505
  current: 0, highwater: 3, success: 37, errors: 8, panics: 0
  error grpc_Internal: 8
  success times:
    0.00: 6.592747ms
    0.10: 8.45673ms
    0.25: 16.348003ms
    0.50: 1.814213888s
    0.75: 2.330315776s
    0.90: 3.009813555s
    0.95: 3.707426406s
    1.00: 7.547346944s
    avg: 1.582796238s
    ravg: 1.582796288s
  failure times:
    0.00: 6.65206ms
    0.10: 7.698964ms
    0.25: 10.01475ms
    0.50: 1.101251136s
    0.75: 2.305745216s
    0.90: 2.917560473s
    0.95: 3.619623244s
    1.00: 4.321686016s
    avg: 1.395976741s
    ravg: 1.395976704s

Pentium100 · February 12, 2020, 3:55pm

I’m in Lithuania (so, far away from the USA) and I got 72% upload success on the 118U satellite. I also have a serer with Xeon CPUs, but home connection and SATA HDDs (with SSD caching).
I restarted my node with debug mode on, will see what it shows.
EDIT: after 4 hours the results are like this:

[6993468895157208617] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload
  parents: 3629722951528974698
  current: 0, highwater: 5, success: 2902, errors: 1105, panics: 0
  error grpc_Internal: 1105
  success times:
    0.00: 11.795547ms
    0.10: 14.103368ms
    0.25: 16.439526ms
    0.50: 26.514155ms
    0.75: 550.93256ms
    0.90: 3.83459863s
    0.95: 5.551193446s
    1.00: 2m47.483523072s
    avg: 1.427969836s
    ravg: 4.040713472s
  failure times:
    0.00: 12.05129ms
    0.10: 14.681443ms
    0.25: 17.2294ms
    0.50: 401.777232ms
    0.75: 1.268356416s
    0.90: 3.649062528s
    0.95: 4.881777817s
    1.00: 16.627924992s
    avg: 1.811662038s
    ravg: 1.348194688s

xopok · February 13, 2020, 10:22am

This is my WTF moment :-/

Raspberry Pi 4, after about 3 hours success rate is 10% (223/(223+2108)):

[...] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload
  parents: ...
  current: 0, highwater: 6, success: 223, errors: 2108, panics: 0
  error grpc_Internal: 2108
  success times:
    0.00: 15.440444ms  <-- The fastest RPi can do.
    0.10: 39.124956ms  <-- 10th percentile for PC is 4ms.
    0.25: 68.928563ms        What is 10x faster in PC?!
    0.50: 684.883872ms
    0.75: 2.815387776s
    0.90: 6.221891072s
    0.95: 13.711663769s
    1.00: 1m5.97943296s
    avg: 3.601078409s
    ravg: 3.351315712s
  failure times:
    0.00: 16.973148ms
    0.10: 37.156816ms
    0.25: 60.174858ms
    0.50: 142.801496ms
    0.75: 1.633698784s
    0.90: 4.464381952s
    0.95: 1m12.833417343s
    1.00: 2m11.451199488s
    avg: 4.006768289s
    ravg: 7.786110464s

PC on the same network and with the same USB 3.0 external disk, success rate 70% (5799/(5799+2525)):

[...] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload
  parents: ...
  current: 0, highwater: 8, success: 5799, errors: 2525, panics: 0
  error grpc_InvalidArgument: 2
  error grpc_Internal: 2523
  success times:
    0.00: 2.821863ms
    0.10: 4.183437ms
    0.25: 4.694519ms
    0.50: 326.346512ms
    0.75: 2.28331808s
    0.90: 4.491589222s
    0.95: 6.590146687s
    1.00: 2m38.454153216s
    avg: 3.693428383s
    ravg: 4.640048128s
  failure times:
    0.00: 3.260301ms
    0.10: 4.153933ms
    0.25: 4.273524ms
    0.50: 5.060252ms
    0.75: 468.568048ms
    0.90: 2.515923788s
    0.95: 4.188076723s
    1.00: 26.7802624s
    avg: 4.60126702s
    ravg: 1.06716224s

I tried playing with CPU governors, forced to ‘performance’ however with no observable effect.

Puzzled.

MadeInGermany · February 13, 2020, 10:44am

Nice that so many people are interested in this.
I hope we will find the reason and maybe something that we can chance, that our Raspberry Nodes run better.

P1R4T3 · February 13, 2020, 11:17am

I hope so to, its really strange that raspberry pi 4 is performing so badly