I noticed a big drop in my egress traffic this month (BTW, when was the v1.60.3 deployed?). At first, I thought it was just a holyday thing, but then I noticed that refreshing the node stats page takes 1 or 2 minutes, which is abnormal.
Then I noticed that the node was using ~8GB of memory, and since the synology is using about 4 GB of RAM, the node must be disk caching like crazy, which explains the delay in refreshing the page and the low egress.
Anyone noticed this on their syno?
What could be the problem and how could I diagnose it better?
Regarding your node problems, the most typical scenario in which a storage node consumes large amount of memory (and coincidently getting its UI slow) is when the drives are not fast enough to cope with traffic. By large I mean more than a gigabyte. This might be the case with SMR drives, using write-inefficient parity schemes or using btrfs as the file system, and can be worked around by moving databases to a separate, faster storage (maybe an SSD, maybe just a separate drive) and reducing the allowed number of concurrent connections.
I donāt know what kind of diagnostics does Synology provide, but on Linux the easiest tool to confirm or deny this scenario is to run iostat -dmx 30 and observe the %util column. If it is consistently above 50%, that would be it.
Yep. I do have 90ās and even 100ās on the %util column on the sataās and some mdās lines.
Anyway, I donāt really get it. I do use btrfs (on sata disks, not smr), but Iāve always used btrfs and only now I have this high memory problems.
I canāt move the DB to a single SDD. I donāt trust SDDās. It would have to be a raid 1. Iām not using 2 bays just to hold the STORJ db, and plus, there is the cost of 2 SDDās.
I could reserve 8GB RAM to the STORJ dock, though. Do you think that might solve everything (in case the low egress is due to cache increase)?
I recall in the btrfs thread I linked that initially it was also fine, up to some point. Though, in my case I noticed that earlier, as I was annoyed by nodes not shutting down quickly. I donāt remember the memory usage at that point. If I had to guess, Iād say that it was because of database updates, as each upload and download triggers an insert or two, and the frequency of downloads grows with the size of a node. As such, I might have just passed some saturation point at which the file system started lagging.
Note that the databases are not critical for storage nodes. They only collect runtime statistics or temporary information whose loss is not vital to operate a node. Databases can be regenerated at any time, too. Hence storing them on less reliable medium is not necessarily a bad idea.
I doubt it, though I donāt have any specific arguments. If the memory usage grows due to lagging writes, it will also easily grow beyond any specific memory reservation. Itāll likely be better to use the storage2.max-concurrent-connections, add some block-level caching (like bcache/lvmcache, in theory even offloading IO reads with a writethrough-style cache should help), or maybeāthough this is still a hypothetical at this time whether it would helpāwait for someone to implement this change.
in 95% of all cases excessive memory usage is due to HDD being unable to keep up.
tho if the storagenode isnāt getting enough cpu that can also lead to excessive memory usage.
(rarely the case because storagenodes generally doesnāt use much compute.)
the memory usage will just keep going up until the issue is resolved or ingress / egress traffic drops, a single node should max use like 1 or 2 GB of memory and on avg stay between 50-250MB
moving your databases to SSD or using an SSD as cache for your storage, which is possible within the synology ecosystem.
no idea how to configure that tho. @BrightSilence runs with a SSD cache on his synology storage setup, maybe he can give you a few pointers on how to do thisā¦
personally i would go with the cache, as moving the storagenode DBās to SSD storage does create the issue that if either the SSD or the HDD/'s fail you will have issues with the node.
however if a cache drive fails, you are basically just back to running without itā¦
which i would consider preferable.
storagenodes create a ton of IO, mostly for various tasks such as database entries and what notā¦ a cache will reduce this amount greatlyā¦ but anywhere from 50-60% or more
or you could consult the synology guides on their website, they seem to have quite a few of those, if memory servesā¦
Well, I did have a 250GB nvme on my syno as cache. But it has died (little more than 1 year of life). I might buy another one and test if the node goes back to normal.
The thing with syno cache is that you canāt programme the cache for what you want. You only get to assign it to a specific volume. I have lots other stuff going in my syno and I have no way of knowing if the cache is serving the node. Maybe I could create a small volume to hold the storage node DB and then assign the cache to that volume onlyā¦
What would be a reasonable size for a DB volume?
Well, I do have bandwidth.db-wal at 97MB, bandwidth.db at 46MB and piece_expiration.db at 16MB.
Anyway, the minumum volume I can create is 10GB.
So, I will get a nvme replacement of 250GB to serve as cache for my 10GB volume filled with less than 200MBā¦ canāt shake the feeling there should be something smarterā¦
My database files are sometimes some hundred megabites in size. And they are constantly growing, so you should plan for some GB I think if youāre planning long term.
850million writes of a minimum sector size of 4k, would be 3.4TB
and the 4.17B read would be equal to a minimum of 16.7TB
so a total of 20TBW used minimum, granted reads isnāt always considered wear depends a bit on the brandā¦
does seem like its a 4K sector based drive, but its possible you have been running 8k or 16k sectors on itā¦ if so that would double or quadruple the wear.
also doesnāt say how much internal data sorting it has doneā¦ it could be that the drive has been filled to close to capacity which makes it shuffle data around internally, a lotā¦
which will lead to the drive dying early.
these drives are basically consumer drives afaik, i wouldnāt expect them to last to longā¦
iām caching at about 1MB/s , so 3600MB an hour.
so 86GB a day, or 31.5TB a year.
but thats just the avg over the 24 days, which has been quite slowā¦
checked a bit further back and does seem like during high activity i hit close to 2MB/s cache writes.
have seen peaks with like +4MB/s for a single node, i think this is in part due to data amplification from database writes, since any write will take up 4k because of the minimum sector size.
but i digressā¦ i would assume 1 to 2 MB/s avg cache workload for a storagenode.
so that would wear out an 970 EVO at 250GB in like two years, ofc that not sustainedā¦
not sure i would recommend that as a cache drive.
generally the issue is that much is cached and not used again and then just thrown out.
depends on the cache mechanics, i supposeā¦ so YMMV
for cache i would use a high endurance enterprise drive with PLP and good write iops.
ofc the iops is more dependent on the amount of workload you want to put on it.
i like having good Q1D1 but that can be tricky.
have had good luck with older intel enterprise drives, their DC 3600 series is quite excellent.
i got a 1.6TB Intel DC P3600 i thinkā¦ tho might be a DC P3610, it has a rating of >40 PBW
you can also get DC S3600 which is the sata versionā¦ might be able to get some m.2 version of the newer ā¦ like say DC P4600 but it drive wear is much lowerā¦ but still in the 5 or so PBW for a 1.6TB SSD
the DC S3600 comes in a 400GB version with like 5 PBW or more i guessā¦ maybe its 10 that would fit with the 1600GB version having about 40PBW because the PBW scales linearly with capacity.
been a while since i looked at the exact specs for them.
also have a DC S4600 which oddly enough is slower in iops than the DC P3600.
this might be down to latencyā¦ due to the higher bandwidth on the PCIe version.
since the DC 4600 series should far exceed the DC 3600 series in performanceā¦ and even tho SATA can keep up with the iops, then it does end up doing less iops, even if the bandwidth used is like 10-20MB/s
didnāt really expect thatā¦ and no iām not hitting the SATA iops limit, since that exceeds like 120k iopsā¦ but for some reason doing like Q1D1 it will cap out much earlier on the SATA rather than the PCIE even tho the 4600 series is far superior ā¦
so yeah i dunoā¦ maybe its the SATA controller throwing me a curve ballā¦
also got a Fusion IO 1.6TB PCIE SSD but i have made that into a L2ARC drive, because it was wearing out to quickly, even tho it has like 5. something PBW
but i do load my cache pretty heavilyā¦
cache is heavy workloadā¦ and should be well considered in advance.
on another noteā¦ i donāt think your drive wore outā¦
because it never used any of its reservesā¦ which is kinda weirdā¦
usually it would get bad sectors / blocks from wear and then use from its 10% reserve capacity.
until it ran out and then it would be worn outā¦
which can be further extended by overprovisioningā¦ but not sure if thats possible on consumer drivesā¦ depends on their firmware / controller i think.
I donāt think my ssd was used up. It just went bad.
The ssd was always at full capacityā¦ after a few days, of courseā¦
Anyway, I bought it from amazon at 80ā¬. They offered a refund. I bought a new one (exactly the same) for 45ā¬. Soā¦ good thingā¦
tried the overprovisioning thing with my fusion iomemory ssd, to try and make it last longerā¦ but didnāt seem to matterā¦ not sure if i did it wrong or somethingā¦
didnāt exactly follow any online guides to itā¦ but i canāt imagine how they would do it better than what i didā¦
my main problem with it was that it started doing a ton of internal writes wearing itself out even tho running at quite low sustained writes and not filledā¦
all this stuff is always a very educational experience lol, and always more subjects to expand upon.
basically the idea with overprovisioning comes down to that SSDās shuffle data around internally, to convert it from SLC to MLC, TLC or QLC statesā¦ this means lots of extra writes and the less room the SSD has to work with the more it can start to shuffle data around.
so overprovisioning is like setting a 1.6TB SSD to like say 1TB of used capacity to be sure it has enough free space to easily move data around internallyā¦
like say from SLC to MLC that would require 3 GB for each 1GB of data and with TLC or QLC it just gets even worseā¦ TLC being 4 to 1 and QLC 5 to 1.
so each 1GB written takes up 4GB QLC (running in SLC mode) of space until its written /converted into 1GB QLC
so one can easily see how a disk could quickly run out of capacity internally when doing thatā¦
upload 10GB and it takes up 40GB and needs a further 10GB before it can delete the 40GB temp data.
and if then it starts to run low on capacity all the data gets scattered into all the little free block of capacity making the workload more demanding because it gets nearer to random read / writes.
and then on top of that, it might need to split it into batches due to lack of spaceā¦ making it start to wear the drive.
SSDās are not always better than HDDās even tho they are in most aspects lol
the wear on SSDās certainly is very different and something we donāt think much about.
The SSDs will keep dying; on Synology it is a poorly thought out misleading marketing gimmick. My advice would be to not use it at all and instead add more ram.
There are other problems with it, besides significant wear.
Got the pointā¦
Never thought SSD caching on synos could be such a bad ideaā¦
Anyway, now I have a brand new 250 nvme. Iāve decided to start by assigning 100GB to the main volume, not to the special storage node database, which I havenāt created yet. I wanted to test if the RO cache on the main volume would take care of my problem.
It kind of did. As soon as restarted the NAS, the āRAMā (fake!) of the storage node container started to rise and the used cache on the nvme also. The container āRAMā eventually went to 6GB, but then started to decrease (as the nvme cache went to 50GB) and sits now at 1.5GB. This is still too high, even because Iāve read the whole syno docker is limited to 2GB real RAM. Hopefully, it will decrease moreā¦
The problem with the main advice in the links youāve shown (āIncrease system RAM instead of using nvme cache!ā) is that the syno docker wonāt use all the system RAM. I have lots of free RAM waiting to be used, but the containers wonāt touch it.
Regarding expensive Intel cache SSDās, since Iām using it just because of storj, it makes no sense to buy it. Iāve never bought anything for storj. Even the nvme was first intended for a desktop. I inserted it in the NAS because eventually I didnāt need it for the desktop. It was just lying aroundā¦
If you buy stuff specifically for storj, youāre wasting moneyā¦
Regarding your advice, I donāt see how to disable checksum on a volume. As far as I know, you can disable it in each shared directory (?!?!).
And how would a different pool layout help? Can you give me an example?
PS- Reloading storage node stats page takes 1 second nowā¦
Oh, thatās an another can of worms. Synology uses outdated fork of docker that exhibits tons of issues, including mis-reporting memory use. Donāt believe what it reports.
On one hand, you donāt have to use docker ā storj binaries are self-contained executables (like anything else written in go) and therefore donāt benefit from dependency isolation containers provide. You can run storage node natively (with systemd on DSM7 or upstart on DSM6).
On the other hand, you can ignore it: free ram was intended not for the container to use, instead it was to be kept unused, to benefit filesystem cache intended to offload some of the random IO from the disk subsystem, hopefully enough for it to be able to handle the rest of the IO.
BTRFS allows to control it to very fine granularity (IIRC on the file level) but synology only allows control on the sub-volume granularity and only on sub-volume creation stage. Synology calls btrfs subvolumes āsharesā. Your best bet, to avoid yanking the carpet from under DSM risking breaking some assumption it may have made, would be to create a new āshareā, not enable āconsistency verificationā on it (IIRC this is what they call checksumming) and then copy storj data from the old share to the new one.
That is likely less applicable to your specific case, but in case you still donāt get enough random IO performance you may want to consider RAID10 instead of RAID5/SHR1. This goes dangerously close to āwasting moneyā you mentioned ā because unless your regular workload benefits from RAID10 you would be wasting time and space, and as a result, money, doing this just for storj.
bought mine used, at like 1-2$ pr 10GB a good while back, so really they are not more extensive than new less durable SSD, generally the used SSD market seems to be very much based on capacity to price for the most part, and the new SSDās set the pricing.
ofc there are some supply and demand considerations, some of the drives i bought have tripled in value after i got them.
Mirrors are not a good solution for storagenodes as most of the workload is writes.
for example right now iām seeing a 24 write IO for every 1 read IO
mirrors are not really faster for writes, infact they are often slower because both disks wants to write in sync, and because their reads are not in sync, they often wait for the other.
ofc this is where the raid 0 in the raid10 comes in, so its a stripe across both, but again, a stripe across two mirrors doesnāt give the storage more IO, which is the main limitation of the storagenodes.
so long story short, even a raid10 wouldnāt be a big performance increase, because each write goes to all disks at the same time, the only advantage of the raid10 is the raid0 which will double the bandwidth and thus halve the time some IO takes to complete.
but this i would expect to be very very limited uplift.
similarly read caching is for the most part very low usage, since the majority of storagenode IO is writes.
Yep, it makes sense. Most storage node IO should be appends to databases. I think Iāll wear this SSD a bit and then buy another one (so that both wonāt crash at the same time) so I can use WR cache. I acknowledge the āagainstā recommendations but Iāve gotta do something and could not come up with a better solution. This time, however, I wonāt repeat the same mistake of using the whole SSD for caching. Iāll do 50%ā¦
In the meanwhile I think synology is using the SSD cache differently. Previously it would just fill up regardless of cache size. Now it seems its algorithm got smarter or they just added a simple rule stating āwhatever cache size the stupid ignorant user sets up, use only 2/3 of itā. Iām saying this because out of the 250GB nvme, I gave the syno 100GB to play with. Itās been a few days and the reported cache used remains pretty constant at 65.9GB, though I threw at it hard Plex database action.
Iāve noticed since that my reported container used memory has increased again (~2 or ~3 GB). Anyway, the %util is sustained at ~50% instead of the (no nvme RO cache) ~90-100%. Also, the access to the storage node status page is fast.
I know you canāt really trust āused container memoryā shown by synology docker, but so far it correlates pretty well with %util and access time to the status page (RO cache!). When I hit 5-6 GB of container used memory, everything turns to shit.