Creating identity on storagenode host, seemingly disrupted existing storagenode identity

SGC · August 15, 2020, 7:55am

I was in process of creating a new identity when i ran into this issue…
upon completition, or within minutes there off… i cannot say exactly because i didn’t have timestamps on my terminal… tsk tsk rookie mistake i know

but i have never seen this issue before and tho my machine is doing a little work… nothing that should cause this… i think…
also should be pretty simple to replicate on a computer thats faster than mine in generating keys, if there indeed is a problem… seems to be for me…

or maybe i used some folders that i wasn’t suppose to use… but i don’t think so… as my identity is located in the storagenode folder along with the config.yaml and storage/blobs folder

curl -L https://github.com/storj/storj/releases/latest/download/identity_linux_amd64.zip -o identity_linux_amd64.zip

unzip -o identity_linux_amd64.zip

chmod +x identity

sudo mv identity /usr/local/bin/identity

ran this and used the same folders, as instructed in the amd64 linux guide path
on https://documentation.storj.io/dependencies/identity

then i ran this, as instructed
identity create storagenode

and it started generating keys just fine… and green lights across the board

identity create storagenode
Generating key with a minimum a difficulty of 36...
Generated 123683548 keys; best difficulty so far: 37
Found a key with difficulty 37!
Unsigned identity is located in "/root/.local/share/storj/identity/storagenode"
Please *move* CA key to secure storage - it is only needed for identity management and isn't needed to run a storage node!
        /root/.local/share/storj/identity/storagenode/ca.key

when finished i noticed that the storagenode log was flashing red, and i shut it down, and after checking the logs started it back up again… with seemly no issues.
and is now 20 minutes later still running with out issues…

i duno what happened, but i can only somehow assume that the creation of the new identity disrupted my existing identity running on the storagenode…

i’m running proxmox (debian buster) which hosts docker directly on the host where i also generated the identity, tho i will state again… the identity locations are completely unrelated to each other, afaik… and i did after all configure it so…

pretty sure this could have killed a node… but i can only speculate, happy i caught it…
i suppose this isn’t a troubleshooting, but more of a bug report…

would be nice if somebody else can confirm this issue.

Alexey · August 15, 2020, 8:06am

If by “issue” you mean the “database is locked”, then it’s unrelated to the identity generation.
It’s known issue: Search results for 'database is locked' - Storj Community Forum (official)
It could be either

or

And there is a little research:

If you moved your previous identity from the /root/.local/share/storj/identity/storagenode to somewhere (I believe you did that, otherwise the generation would fail with “file is exist”), then there is nothing unusual.
By the way, you can use a different name for the identity to avoid confusion:

identity create storagenode2

for example and then use this new name to sign the identity.
The path would be /root/.local/share/storj/identity/storagenode2.

SGC · August 15, 2020, 8:18am

well i’ve not had a problem with this for a long while… and never like this before…
seemed like i was locked out of all databases…
and it happened within a couple of minutes of the new identity generation being completed…

i could ofc just try to move the new identity and try to generate it again, just to confirm that it will disrupt my running storagenode…

yeah i did consider the storagenode2 option… but i figured it wouldn’t make a difference since the storagenode is inside docker and its identity is in a custom location…

Alexey · August 15, 2020, 8:28am

If you use the same disk for the identity generation and for the data pool - then it could be a reason of course.
I mean - if databases on the same disk, where is your temp files and system files and generated identity files.

SGC · August 15, 2020, 8:37am

pretty sure the files where on my OS drive, which has its own separate pool / partition / logical volume or whatever it’s called while all related storagenode folders and identity files are located on my big raidz storage pool / partition / logical volume

they may share some data in the ARC because i’m using zfs… so it could be a zfs only issue… but i can only speculate… ill try and reconfirm the issue tomorrow or tonight when i have more time to rerun the identity generation while keeping an eye on the storagenode.

then if i can replicate the issue, the next logical step would be to try and replicate the issue on other debian / linux distributions using other filesystems.

TheMightyGreek · August 15, 2020, 11:04am

I’m generating a new identity on my node right now, it has been running for about 15 hours (slow arm processor) and I haven’t seen anything unusual.
I moved the identity from the other two nodes to the hard drive where data is located so they shouldn’t interfere with the new identity.
Best key difficulty is 35 so not too far from completion with almost 200 million keys generated.

I’ll see if anything happens when the identity is generated.

SGC · August 15, 2020, 11:15am

yeah i was lucky to get it at 128mil tries took like 15 or 20 minutes…
last time it took a very long time… i think… but i went to sleep so not exactly sure how long … are you sure the odds are good for it hitting it in a week…?

i suspect you will be affected… ofc an arm platform might not have the exact same coding…but i doubt it will matter…

i would say 60-80% odds that upon completion you will be locked out of the database… you should setup one of brights tail scripts on an external log file and then have it reboot the node if it runs into an error… ofc that sort of requires you not having errors in the first place
which isn’t always the case…

but if i was you i wouldn’t take the change without a failsafe… because who knows what happens if it runs for hours having db locked to all requests…

and it might take days for your identity generation to complete.

TheMightyGreek · August 15, 2020, 11:19am

I did generate another identity on this system about a month ago and it all went smoothly. Also was quite a bit faster from what I remember, I launched it, went to sleep and it was done in the morning.
I’m home for the afternoon so I’ll keep an eye on the colored logs, way easier to spot errors than with the standard logs !

SGC · August 15, 2020, 12:07pm

ofc it could just be my system being weird… haven’t confirmed i can replicate it yet…
or arm is not affected… you are running a version of debian… so i suppose thats one point for ruling out the OS as a cause…

i don’t suppose you run zfs on that little thing?

TheMightyGreek · August 15, 2020, 1:02pm

nope no zfs, the identity still isn’t finished, I almost regret not creating it on my laptop…
Getting close to 220 millions of keys generated.

SGC · August 15, 2020, 2:03pm

well you can always just quit and generate it somewhere else… that little RPI might take an EON to generate it… i’m pretty sure i waited for hours the previous time… and if my server can beat RPI processing in 15-20 minutes even if we say 30… thats 36 times faster… and so if it took hours … i think atleast 3-4 before i went to sleep…

thats 4-6 days and i guess it could take longer… but this time it was fast… so you are in the range of where its possible… but might easily be 1 -2 days more lol maybe i duno… just a gutshot…

also depends on what kind of cpu power you got to replace it with…

TheMightyGreek · August 15, 2020, 2:35pm

I finally gave up on generating the identity on the Odroid, my laptop has an i7 8565U quad core so it should be way faster than that little arm processor.
Probably not as fast as your two xeons though…
I should really spin up the old dell R710 that’s sitting underneath my bed but it’s so noisy that I don’t really know where to put it. I can hear it from the kitchen even when it’s in the guest bedroom with the door closed… I’ll eventually get around to building a home server with some decent specs but I still have to convince my wallet to let me do that…

EDIT: 1h30 later and my identity is generated !

Pac · August 15, 2020, 4:41pm

lol

I tried generating an identity once on my RPi 4B, and it took a few tens of minutes only. At the time I did not know it, but I got very lucky, because the next time I tried I gave up after 8 hours, and spun it up on my i7-4720HQ and it took less than 10 minutes, with a result after only 25 millions of keys

But yeah, small machines are not a good fit for generating identities. They’re slow, and while generating there’s not much CPU left for other programs like the node software…

SGC · August 15, 2020, 5:07pm

well my xeon’s are 10 years old, so even tho they have 16 threads between them… then it’s only 2.13ghz… so if i was to hazard a guess they are most likely not very unevenly matched… ofc xeon’s have a lot of cache which makes them much faster for some stuff…

i find that looking at frequency and threads gives a pretty nice ball park estimate for general computing tasks.

so my cpu’s have doubt the threads but half the frequency… so nobody is running laps around the other thats for sure… but very rough estimate…

BrightSilence · August 16, 2020, 11:40am

Seems you’re really jumping to conclusions here. The database is locked issue pops up mostly on systems with either db issues, file system issues or IO bottlenecks. Since you’ve been going on about a drive acting up in your array in other topics, my bet is on IO bottlenecks for your setup. The identity generation really has nothing to do with it, other than at best using some more CPU cycles which makes your IO wait even worse. Knowing you, you were probably also running a scrub, making IO even more bottlenecked.

SGC · August 16, 2020, 2:43pm

maybe you are right… but still never seen the system lock me completely out of all the databases… and it happens very suspiciously close to each other…
i’ve seen db locked in the past… but these days my logs has been error free for days sometimes… but that disk was going kinda crazy… the record was 8sec latency before it now is back below 15ms again…

also hadn’t had a reboot for a few weeks… so might have had some loose threads floating around in the host… but ill try to confirm the bug, see if it happens again… atleast now everything is back to normal and i haven’t turned anything aside from the storagenode on…

also when it happened i did check the iowait, and tho it was in the high end of the spectrum it wasn’t worse than it has been while the storagenode ran fine… ofc monitoring will not take all things into consideration so just because i cannot see it doesn’t mean it isn’t whats happening…

also verified my disk access, and nothing else seemed to stall, also a docker stop and docker start storagenode, made the issue completely vanish and the storagenode ran almost without a cancelled download… or upload

which also seems to go counter to that it was a system issue… what happened i cannot tell you…

i can tell you i never seen it act up like that before, that i seemed to be locked out of all databases and it happened within 2-3 minutes of the identity generation finished… and i had used all default on everything…

i was waiting for the generation to finish because i hoped it would get it fast, so i could continue setting up a new node, to how a storagenode would run from the container setup i got ready for my primary node

have been seeing some weird docker related messages on the server tho… not quite sure what thats about… but i doubt thats relevant… it could also be something related to the dedup / ksm or the zfs arc getting stuff mixed up…

if i had any good reason to think it was my system, then i wouldn’t have made the post…
it seems very unlikely to me, but i suppose it’s not impossible… i’m a bit scared to replicate it tho, because i have no doubt it can kill my node dead… dead dead… in short order…

and i know that identity generation took me like 6-8 hours the last time, so it might be a live grenade… so i better be damn sure the failsafes catch it…

BrightSilence · August 16, 2020, 3:58pm

That’s expected with this error. Stopping the node will settle all ongoing transfers and close all connections to the db’s. Giving the node a fresh start after that is likely to fix the problem. This is however not evidence that it’s not a system issue. You gave system resources a break and then it worked again. That pretty much proves the opposite. Like I said, identity generation is a completely independent process. The only way it could have anything to do with it is through impact on system resources.

Feel free to try and replicate it with the identity generation, because:

It’s completely unrelated
Even if it wasn’t, audits don’t need db’s, so as long as transfers are still happening, the occassional error like this isn’t going to get your node disqualified.

Also… you’ve been posting all around about a disk acting up, but you’re sure it’s not a system issue? Meanwhile you’re having errors that are known to happen during IO bottlenecks. Give that a good think before you dismiss something.

SGC · August 16, 2020, 6:36pm

i understand your point… i’ve only been forced to shutdown my node three times because of something like this, and every time was because of a bad zfs command and 3rd time was only to confirm the issue…

so yeah i did kinda link it to the identity generation, i cannot say that this is a common problem that will affect everybody… could be a combination of debian / zfs / docker or proxmox, zfs and docker.
which all uses memory in some really intricate ways to converse memory… also i have been running 2 nested containers with docker on them, and all vm’s share a dedup storage…

it was the conclusion i derived from how the things aligned… i also made it very clear that it might not be correct… but since you want to make such a point about it, i’m pretty confident that it has to be related…

was it dangerous to my node… you most likely know better than i… did i care to find out if it was dangerous… not really…

io bottle necks… it does seem to slow my system down when it gets disk issues… i guess i should simply try and decrease the wait time allowed before the system simply ignores the problem and continues without the drive… i believe i might have increased the allowed timeout on the HBA’s which maybe biting me in the arse…

but the disk is essentially redundant i would almost rather that the system continued without it…i doubt that was the reason tho… all of the databases wouldn’t accessed from those drives, anything that has been used like twice or three times would be accessed from the L2ARC or ARC (memory) and any writes would go to the slog… so all database activity when running even with high hdd disk latency should be able to continue without access to the hdd’s… i have at times crashed my hdd pool by pulling more drives than the redundancy would handle… and it doesn’t really matter much… sure it does get kinda angry that it cannot access the drives… but thats the pieces not the databases…
ofc not something i’ve made a habit of doing much… was testing out what happened when i was starting out to get a sense of how well my redundancy worked and what would happen on stuff like high latency or me doing crazy stuff like pulling wrong drives…

didn’t even loose a byte
ofc i did add the disks back which in a minute or two

kevink · August 16, 2020, 8:24pm

how should a pure CPU load like generating an identity cause io issues and database locked errors? The only relation can be that it uses your CPU 100% and therefore your already high iowait just got higher and the latency got higher so at some point the storagenode ran into a timeout with the database and logged it as “database is locked”.
But that’s still not an issue of identity creation. It’d be a general problem of high io and high cpu usage. Could have just as well been CPU mining, video decoding or similar.
However, typically in all those cases the CPU time is shared between all processes well enough so you wouldn’t run into any DB issues. At least not typically, but as BrightSilenece already mentioned it was during a time where your setup had immense IO issues.

SGC · August 16, 2020, 8:47pm

that’s not what i think happened… i think that for some reason the new identity replaced or corrupted the old one in memory, which cause me to get db locked on every db request…

cpu utilization was never above 60-65%, besides i don’t think the xeon cpu will allow me to stall the OS from high cpu utilization, if the identity would allow it… kinda makes me ponder if linux isn’t using my cpu boost correctly…

with regard to the IOwait… sure some stuff was running slower than usual… a reboot seemed to fix whatever was going on, maybe some process or vm code never got terminated correctly…

wasn’t like i couldn’t get access to the pool… as it should be since the drive is redundant, else something isn’t configured right… but it’s ofc possible… ill have to do a deep drive on my hba configurations and whatever else maybe zfs device timeouts… because if it happened it should and the drive should just be disconnected instead… else whats the point in redundancy if it can disable the entire pool just because 1 drive has issues…