SMR detail question

skibboo · February 22, 2021, 4:56pm

Hi team,

i have read so many threads about it.
Just want to make sure i got it.

First, 1 have a battery in front of my Synology.

Understood that the problem is if an customer upload cannot be taken from my disk anymore, because of load or other problems.

But there is the buffer size in the config.
Guessing that the write acknowledge is done as soon as the file has arrived the RAM, right?

That would mean a problem comes only if my disk load is unable to take over the files from the RAM, AND the RAM is fully blown.
Is that the case?

I have an 4 TB SMR WD Blue with 256 MB cache and a 2 GB RAM Synology DS220+ for two hosts.
Guessing two SMR disks are better than one as they split the uploads

Buffer is set to 1 MiB, do you think this will work?

Thanks!

skibboo · February 22, 2021, 5:03pm

Ah, and another question.
Is it possible to put all logs into a RAMdisk?

Would be deleted after a host reboot, but should also reduce the stress for an SMR disk.

Pac · February 22, 2021, 7:37pm

That’s my guess too because when an SMR drive is stalling the RAM usage usually goes dangerously up.

Well at least that’s what I experienced when my SMR drive could not keep up with the load. Which happened during heavy tests in the past, that particular situation never happened since. Which doesn’t mean it will never happen again though

The buffer size will not help much. Even though in theory it’s better to have it set to 1 or 2MiB, I tried many values but none would prevent the disk from stalling, the RAM from filling up and eventually the node process from being killed by the OOM killer.

Some improvements were made to the Storage node software since though, such as one of the databases was moved to RAM and other enhancements, so things are probably better now.
Also, there are some minor things that can be improved like moving databases and logs to another disk like you mentionned.

But ultimately, if the disk cannot keep up with the incoming ingress load, then there’s not much that can be done appart from throttling down the number of accepted pieces in parallel (or switching to a CMR disk obviously ^^'). Which is far from ideal, but AFAIK there’s no other solution, yet.

Sure, logs can be redirected to anywhere you’d like, so it’s doable if you set up a RAMdisk, but it would reduce the amount of RAM available to the system.
For your information, right now my nodes hold roughly 5TB of data and their logs for the month of January take 735MiB in total, and this was a quiet month. So the RAMdisk could not be too small, so depending on how much RAM you have available for running Storj related stuff this might not be ideal. Would be better to find some space on a spare disk somewhere ^^’

skibboo · February 22, 2021, 8:02pm

Thanks for the quick response!

For testing i have a buffer size of 4 MiB.
What i just saw is that the HDD has a usage from 3-25% and the RAM is growing up to about 300 MB.
Then the disk goes to 100% and in parallel the RAM was cleaning up to 150 MB in seconds.

Guessed that every 4 MiB block would be written as soon as it is full?
That must mean that the average use of RAM and HDD would be more balanced as it is.

Not sure why this comes to that working, but it does this all the time.
Is there maybe a max RAM buffer to be configured?

skibboo · February 22, 2021, 8:04pm

So putting Logs to a RAMdisk would not really be helpful.
Maybe going down to warnings.

Pac · February 22, 2021, 9:07pm

I’m not seing such patterns on my end, I cannot tell whether it’s normal, okay or not.
My RPi 4B running my nodes has been steadily at 366MiB of RAM usage, for the past few minutes.
What’s the time window of this behavior?

I don’t think that’s how it works. When you mentioned the “buffer size in the settings” in your first post, you were referring to --filestore.write-buffer-size right?
This setting defines how much space will be allocated in memory for writing each piece.
If you receive 10 pieces in a very narrow time window, a total of 40MiB will be allocated for writing these 10 pieces to disk.

I don’t think there’s any option for setting a max RAM usage the node software shouldn’t go above.
Which in my opinion is a shame because it would be an efficient way for providing a safety net against stalling disks like SMR ones.
With such an option, the node software could receive as much data as you sent by customers as long as the storage device can handle it. But if the RAM were to reach a threshold because the disk is starting to be overwhelmed, then it could start refusing new pieces automatically to let time for the disk to handle what’s on its plate.

Alexey · February 23, 2021, 4:03am

Do you know if the PC is crashed or rebooted all that data will be lost? But since the node is confirmed that it got them - it could be disqualified for losing pieces.

The only known solution for SMR disks is to run more than a one node in the same /24 subnet. You will reduce a load in two or more times to the level which SMR can handle. If the second node is not enough - run third… Obviously they should point to own disks.

Pentium100 · February 23, 2021, 7:46am

Why does the node confirm that it has the piece without making sure that it is on the drive first?

Shouldn’t piece writes and some writes to the database be in “sync” so that if the server crashes or reboots at any time, there is no piece that gets lost because it was confirmed to the satellite, but not yet writen to the disk?

Alexey · February 23, 2021, 7:58am

They should, but then the RAM would not be used too much - pieces should be flushed to the disk to be confirmed. And we again would have a stale SMR.

This is the main problem - you could not have enough RAM to receive pieces while disk is unable to write them. If node would wait for the disk, it will have cancels from the customers.
So, buffer would not help too much. The big buffer will not help either because of long tail cut and slow disk. Or the node should confirm the piece as soon as received, but this is a high risk to lost it if storagenode would be killed by OOM or system got rebooted.

Maybe there should be a SSD as a write cache for the data and then something should slowly move it to the SMR.
Please take a look on
https://forum.storj.io/t/looking-for-help-to-set-up-raid6-dual-ssd-cache/10556/7?u=alexey

Pentium100 · February 23, 2021, 8:30am

But better data integrity for nodes with CMR drives.
Or nodes with SSD write cache (zfs SLOG or similar)
Or nodes with RAID controllers that have battery-backed cache.

I would rather have less data than no data after an unexpected reboot.

kevink · February 23, 2021, 8:40am

After an unexpected boot you typically lose like 10 seconds of data. On a reasonably sized node that will completely irrelevant and your audit score will remain good.
Of course with a stalling SMR drive this might ramp up to (idk) 5 minutes, which at an ingress of 10Mbps is ~375MB at an avery piece size of 1.5MB which will be 250 pieces. Still irrelevant if you have a 1TB node hosting >600k pieces. In this case you would have lost 0.04% of your files. That’s ok if it doesn’t happen too often.

With zfs you can choose to use sync=always on the node dataset which will make sure all your files end up on the HDD.
Not sure if other systems/filesystems have similar options.

Pentium100 · February 23, 2021, 8:58am

They do, however, then you need to move the database somewhere else. The reason is that there are lots of writes to the database that probably are not that important (maybe soemthing very short lived etc, if it was important it would be in sync), so, mounting te filesystem with sync option results in a a lot of IO.
I have tried this some time ago, maybe it would be different now because orders were moved to RAM and such.

kevink · February 23, 2021, 9:11am

With an SMR you should move your database somewhere else (well an SSD or at least another CMR) anyway. Besides, database writes are always sync.

Pentium100 · February 23, 2021, 11:00am

I don’t think so, at least this was not true in the past. If db writes were always sync and because sqlite does support logs, there should be no database corruptions after an unexpected reboot.

I remounted the data partition as sync and will see how it goes. In the past it used to increase iops by a lot.

kevink · February 23, 2021, 11:09am

https://www.sqlite.org/asyncvfs.html

Unless storj uses this async sqlite module, all writes to the db are sync. That’s how every database works. everything else would just be a recipe for disaster.

As to why DBs tended to get corrupted by unexpected reboots: I’m not sure… most filesystems just aren’t that stable and an sqlite db itself might not be the most stable because of all performance optimizations used by storj I guess.

yes because now it can’t just flush every 10 seconds but needs to write every small piece of buffered file to the disk immediately and without a slog it will have to do that twice. So it’s a horrible option for SMR drives.
But if you’re using a slog, the performance should still be very good since it basically still flushes from RAM every 10 seconds because the sync only writes the file to the slog immediately. So with a slog it could still be good for a SMR and not create much of a difference.

Pentium100 · February 23, 2021, 11:58am

Which probably included doing some async writes.

However, on average it would be the same MB/s to disk, whether it is written immediately or flushed every few seconds. IIRC, it increased. So, I assume some writes to the database were very short lived, where something got written, then it was changed before it could even be flushed to disk.

I agree, it would coalesce writes

kevink · February 23, 2021, 1:51pm

Seems right. As soon as I set my Storj DB dataset to sync=always, I immediately got more reads and writes on my SSD. So I guess they might have tampered with async writes… risky for a db…
(I was surprised about the increase in reads but it looks like the changes don’t immediately end up in the arc when using sync=always but need to be written first and will then be read into the arc again? Note: My SSD doesn’t use a SLOG, would be kind of pointless…)

Since the db is constantly written to, the content changes often within the 10 seconds between each flushing. But with async (and due the arc caching) most operations will be done in RAM and you only get the end result after 10 seconds on the disk. But if every change has to be written to the disk, you have a lot more data over the same period.

SGC · February 25, 2021, 9:20am

async writes can take a long time before they are written to disk, it’s my understanding that these delays are what is presented when doing a zpool iostat -w

but the numbers doesn’t make sense for me, my pool,

>  zpool iostat -w
> 
> bitlake      total_wait     disk_wait    syncq_wait    asyncq_wait
> latency      read  write   read  write   read  write   read  write  scrub   trim
> ----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
> 1ns             0      0      0      0      0      0      0      0      0      0
> 3ns             0      0      0      0      0      0      0      0      0      0
> 7ns             0      0      0      0      0      0      0      0      0      0
> 15ns            0      0      0      0      0      0      0      0      0      0
> 31ns            0      0      0      0      0      0      0      0      0      0
> 63ns            0      0      0      0      0      0      0      0      0      0
> 127ns           0      0      0      0      0      0      0      0      0      0
> 255ns           0      0      0      0      0      0      0      0      0      0
> 511ns           0      0      0      0      0      0      0      0      0      0
> 1us             0      0      0      0  2.34K  34.7K    300     33      0      0
> 2us             0      0      0      0  1.24M  2.04M   385K   468K     17      0
> 4us             0      0      0      0  2.34M  6.99M   437K  3.27M     75      0
> 8us             0      0      0      0   127K   852K  29.5K   710K      4      0
> 16us            0      0      0      0  12.4K  43.3K  2.69K   173K      0      0
> 32us            0  1.90K      0  4.99K  10.5K  7.77K  2.56K   202K      3      0
> 65us          706  2.08M    808  2.36M  4.87K  2.83K  2.27K   354K      3      0
> 131us       5.85K  4.06M  6.56K  3.95M  5.08K    635  2.94K   730K     11      0
> 262us        241K  3.50M   256K  9.09M  8.23K     20  4.18K  1.49M     18      0
> 524us        648K  3.26M   658K  23.1M  10.2K     10  3.78K  2.30M     32      0
> 1ms          180K  4.94M   189K  4.89M  9.54K      4  4.78K  2.97M     27      0
> 2ms         71.7K  4.70M  74.3K  3.13M  7.26K      4  3.40K  3.57M     14      0
> 4ms          124K  4.39M   151K  1.41M  11.7K      2  6.30K  3.64M     18      0
> 8ms          469K  4.16M   581K  1.97M  23.8K      1  25.0K  3.64M     33      0
> 16ms        1.16M  3.95M  1.32M  3.04M  46.2K      0  68.4K  3.51M     58      0
> 33ms        1.18M  4.58M  1.27M  1.78M  70.6K      0   126K  4.23M     70      0
> 67ms         730K  6.25M   647K   940K  72.1K      0   139K  5.79M     51      0
> 134ms        499K  6.66M   313K   355K  37.9K      0   120K  6.05M     30      0
> 268ms        183K  3.20M  55.5K  46.2K  11.1K      0  60.8K  2.80M      3      0
> 536ms       28.8K   259K  4.03K  3.75K  5.50K      0  11.2K   214K      0      0
> 1s          6.68K  23.7K    287    919  3.06K      0  2.54K  21.5K      0      0
> 2s          1.59K  7.05K     89    311  1.26K      0     12  6.46K      0      0
> 4s             36  2.61K     25    188      5      0      0  2.38K      0      0
> 8s              0  1.18K      0    271      0      0      0  1.09K      0      0
> 17s             0    610      0      0      0      0      0    563      0      0
> 34s             0    327      0      0      0      0      0    309      0      0
> 68s             0    147      0      0      0      0      0    137      0      0
> 137s            0     30      0      0      0      0      0     28      0      0
> --------------------------------------------------------------------------------

SLOG flushes every 5sec but that is adjustable… default is 5 tho.
then how can my pool with all my dataset… maybe i should check if that is true the all datasets
running sync=always and still i get higher than 5sec of this
iostat -w thing…

and it’s certainly not because it takes 137sec or more … even the 10sec mark seems rather extreme…
i got like intermittent writes of maybe 50-80 mb/s
but it will do like 1.2-1.3 GB peak… sure we need to look at iops… but sync iops i should be able to do like 2000 maybe even 3000… raw hdd iops…
dual raidz1 and maybe 2x1000 sequential write iops
the reason i say 2k is because i’ve seen it do that before… even think i’ve seen it do near 3… but that was when i was doing a 3x raidz1 pool… tho using sata slogs

yeah checked Bitlake has every dataset and itself on sync always.