Node start and stop errors

Hi guys,

recently I started having some issues with my node installed on a Raspberry Pi 4.

I’m running both Main Net and Test Net in parallel. Form time to time I lose connection with the Main Net and when I try to restart the node on the Main Net I get the following error:

$ goal node start
Algorand node failed to start: node exited before we could contact it

Moreover I can not stop the node:

$ goal node stop
Cannot kill node: operation not permitted

Any idea about what is going on?

Can you look at $ALGORAND_DATA/algod-err.log and $ALGORAND_DATA/algod-out.log?

where $ALGORAND_DATA is your data folder, which is /var/lib/algorand by default in case you used the Debian package

algod-err.log is empty, maybe because I’ve just restarted the Raspberry Pi.

algod-out.log content:

Logging to:  /home/pi/node/data/node.log
Deadlock detection is set to: disabled (Default state is 'disable')
Initializing the Algorand node... Success!
⇨ http server started on 127.0.0.1:8080
Node running and accepting RPC requests over HTTP on port 127.0.0.1:8080. Press Ctrl-C to exit

Once I restart the Raspberry Pi the node starts, but I’m still unable to stop it:

$ goal node stop
Cannot kill node: operation not permitted

Most likely the issue is a permission issue.

If you’re using the Debian package, the node is started under the algorand user, rather than your user.
It is best to use systemctl to manage it in that case.

If you want to use the goal command, you need to can use sudo -E -u algorand goal node stop.

I am also experiencing a deadlock on my MainNet archival node. I build from source and periodically stop/start my nodes. Today I noticed my MainNet node was stopped and both TestNet and BetaNet nodes were operational. No worries, it will catch up, I thought. I built master and stopped the other two nodes. They both restarted fine and are operational (however, those are not archival). The MainNet node produced the following (truncated) node.log file:

{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-si.algorand-mainnet.network:4160 for block 8191384","time":"2020-08-06T18:17:04.733513Z"}
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-cs.algorand-mainnet.network:4160 for block 8191385","time":"2020-08-06T18:17:05.271563Z"}
{"Context":"Agreement","Hash":"","ObjectPeriod":0,"ObjectRound":0,"ObjectStep":0,"Period":0,"Round":8191369,"Sender":"","Step":0,"Type":"RoundInterrupted","Weight":0,"WeightTotal":0,"file":"demux.go","function":"github.com/algorand/go-algorand/agreement.(*demux).next","level":"info","line":285,"msg":"agreement: round 8191369 ended early due to concurrent write;$
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-fr.algorand-mainnet.network:4160 for block 8191386","time":"2020-08-06T18:17:05.522169Z"}
{"Context":"Agreement","Hash":"","ObjectPeriod":0,"ObjectRound":0,"ObjectStep":0,"Period":0,"Round":8191370,"Sender":"","Step":0,"Type":"RoundInterrupted","Weight":0,"WeightTotal":0,"file":"demux.go","function":"github.com/algorand/go-algorand/agreement.(*demux).next","level":"info","line":285,"msg":"agreement: round 8191370 ended early due to concurrent write;$
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r13.algorand-mainnet.network:80 for block 8191387","time":"2020-08-06T18:17:05.545599Z"}
{"file":"deadlockLogger.go","function":"github.com/algorand/go-algorand/daemon/algod.(*dumpLogger).dump","level":"error","line":36,"msg":"[Stack] goroutine 66 [running]:\nruntime/debug.Stack(0xc000080820, 0xc000010570, 0xc00a940780)\n\truntime/debug/stack.go:24 +0x9f\ngithub.com/algorand/go-algorand/logging.logger.Error(0xc000080820, 0xc000010570, 0xc00a8cfc70,$
{"file":"deadlockLogger.go","function":"github.com/algorand/go-algorand/daemon/algod.(*dumpLogger).dump","level":"error","line":36,"msg":"POTENTIAL DEADLOCK:\nPrevious place where the lock was grabbed\ngoroutine 86 lock 0xc000259788\ngithub.com/algorand/go-algorand/ledger/ledger.go:333 ledger.(*Ledger).notifyCommit ??? \u003c\u003c\u003c\u003c\u003c\ngithub.com$
{"file":"deadlockLogger.go","function":"github.com/algorand/go-algorand/daemon/algod.setupDeadlockLogger.func1","level":"error","line":57,"msg":"[Stack] goroutine 66 [running]:\nruntime/debug.Stack(0xc000080820, 0xc000010570, 0xc00a941400)\n\truntime/debug/stack.go:24 +0x9f\ngithub.com/algorand/go-algorand/logging.logger.Panic(0xc000080820, 0xc000010570, 0xc00a$
{"file":"deadlockLogger.go","function":"github.com/algorand/go-algorand/daemon/algod.setupDeadlockLogger.func1","level":"panic","line":57,"msg":"potential deadlock detected","time":"2020-08-06T18:17:35.564844Z"}

algod-err.log:

goroutine 8320 [select]:
net/http.(*persistConn).writeLoop(0xc00524dc20)
        net/http/transport.go:2277 +0x11e
created by net/http.(*Transport).dialConn
        net/http/transport.go:1648 +0xc7d
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^$
panic: (*logrus.Entry) (0x1a30dc0,0xc00a941630) [recovered]
        panic: (*logrus.Entry) (0x1a30dc0,0xc00a941630)

goroutine 66 [running]:
github.com/algorand/go-algorand/logging.logger.Panic.func1(0xc000080820, 0xc000010570)
        github.com/algorand/go-algorand/logging/log.go:256 +0x89
panic(0x1a30dc0, 0xc00a941630)
        runtime/panic.go:969 +0x16a
github.com/sirupsen/logrus.Entry.log(0xc000076ea0, 0xc00a9682a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        github.com/sirupsen/logrus@v1.0.5/entry.go:112 +0x2f5
github.com/sirupsen/logrus.(*Entry).Panic(0xc00a9415e0, 0xc00a9720d0, 0x1, 0x1)
        github.com/sirupsen/logrus@v1.0.5/entry.go:182 +0x104
github.com/algorand/go-algorand/logging.logger.Panic(0xc000080820, 0xc000010570, 0xc00a9720d0, 0x1, 0x1)
        github.com/algorand/go-algorand/logging/log.go:260 +0x170
github.com/algorand/go-algorand/daemon/algod.setupDeadlockLogger.func1()
        github.com/algorand/go-algorand/daemon/algod/deadlockLogger.go:57 +0x24a
github.com/algorand/go-deadlock.lock(0xc00548e060, 0x1a85920, 0xc000259788, 0x1)
        github.com/algorand/go-deadlock@v0.2.1/deadlock.go:252 +0x833
github.com/algorand/go-deadlock.(*RWMutex).RLock(0xc000259788)
        github.com/algorand/go-deadlock@v0.2.1/deadlock.go:152 +0xc1
github.com/algorand/go-algorand/ledger.(*Ledger).Lookup(0xc000259000, 0x7cfc4d, 0xa32dc9e980015614, 0x306e3572c8a31a17, 0xc14ba7c4967feaa8, 0x8fcb949c92685a75, 0x0, 0x0, 0x0, 0x0, ...)
        github.com/algorand/go-algorand/ledger/ledger.go:392 +0x9b
github.com/algorand/go-algorand/data.(*Ledger).BalanceRecord(0xc000263060, 0x7cfc4d, 0xa32dc9e980015614, 0x306e3572c8a31a17, 0xc14ba7c4967feaa8, 0x8fcb949c92685a75, 0x0, 0x0, 0x0, 0x0, ...)
        github.com/algorand/go-algorand/data/ledger.go:191 +0x9d
github.com/algorand/go-algorand/agreement.membership(0x1aae980, 0xc000263060, 0xa32dc9e980015614, 0x306e3572c8a31a17, 0xc14ba7c4967feaa8, 0x8fcb949c92685a75, 0x7cfd8d, 0x0, 0x2, 0x0, ...)
        github.com/algorand/go-algorand/agreement/selector.go:67 +0x18d
github.com/algorand/go-algorand/agreement.unauthenticatedVote.verify(0xa32dc9e980015614, 0x306e3572c8a31a17, 0xc14ba7c4967feaa8, 0x8fcb949c92685a75, 0x7cfd8d, 0x0, 0x2, 0x0, 0x7649dd7a1025ef2c, 0x6def3595f4d49cdb, ...)
        github.com/algorand/go-algorand/agreement/vote.go:91 +0xf6
github.com/algorand/go-algorand/agreement.(*AsyncVoteVerifier).executeVoteVerification(0xc0018b2050, 0x19abfe0, 0xc0052b5000, 0x2, 0x2)
        github.com/algorand/go-algorand/agreement/asyncVoteVerifier.go:105 +0x26b
github.com/algorand/go-algorand/util/execpool.(*pool).worker(0xc0001cdd40)
        github.com/algorand/go-algorand/util/execpool/pool.go:159 +0x122
created by github.com/algorand/go-algorand/util/execpool.MakePool
        github.com/algorand/go-algorand/util/execpool/pool.go:85 +0x172```

Until now I didn’t need to restart the node, I’m just managing an autonomous start of the node on my Raspberry Pi bootstrap, but I’m experiencing something similar to you: I go out of sync only on the Main Net while Test Net keeps running smoothly.

It happend again, I lost connection with the MainNet.

Here is the algod-err.log:

runtime: out of memory: cannot allocate 4202496-byte block (2622619648 in use)
fatal error: out of memory
...

Hi @cusma,

You have not specified which version you’re running. A new version was released today which reduces the memory utilization. Would you mind trying this one out ?
( in this case, it’s more than just a good idea to keep your node updated… )

Also, how much memory is available on your Raspberry PI ? the above indicates that 2GB are in use. Is that the case ?

Hi @tsachi,

my node just updated (the process is scheduled through crontab), this is current version:

8589934601
2.0.9.stable [rel/stable] (commit #a7718244)
go-algorand is licensed with AGPLv3.0
source code available at https://github.com/algorand/go-algorand

I run a node on a RaspberryPi 4, it has 4 GB of RAM totally, nothing but the node runs on it (both TestNet and MainNet run in parallel).

The node is currently synchronizing, I’ll show up here in case something goes wrong again.

Hi, today I lost synch with the MainNet again. I’m running this node version on Raspberry Pi 4 (with 4 GB RAM):

2.0.9.stable [rel/stable] (commit #a7718244)
go-algorand is licensed with AGPLv3.0
source code available at https://github.com/algorand/go-algorand

This is the algod-err.log:

runtime: out of memory: cannot allocate 16384-byte block (2489974784 in use)
fatal error: out of memory

@cusma,

Could you please confirm whether prior to the node running out of memory it was able to sync up with mainnet or not ?

Is that an archival node, relay node, indexer node or none of the above ?

Hi @tsachi,

the node is a simple Participation Node (nor Archival, nor Relay, nor Indexer).

From the recently transactions volume growth the node is experimenting an increased Time since last block on the MainNet for some round of the protocol. Before the transactions volume growth the Time since last block was always lower than 5 seconds.

I noticed that the node is currently accumulating delay sometime, than it catchs up again, so I’m not 100% it was fully synch before the memory error.

Hmm. I’m not sure why the node was accumulating delays.
It could be, as you suggested, due to the higher congestion we’ve seen on mainnet.
If that it truly the case, then maybe the upcoming Wednesday release would improve the situation, as it has several performance optimizations.

One other issue that could negatively affect the node performance is slow storage.
On RaspberryPI it’s an SSD card, but I have never tested it’s relative performance compared to a x86 platform equipped with SSD HDD. I know that some low-end network storage devices, which typically have higher latency, won’t be able to sustain the full block load ( i.e. ~1200 txn/sec ).

Last, I suspect that the reason it ran out of memory is because you were running both testnet and mainnet on the device. As mainnet grows in volume, the temporary per-round changes are growing as well, which eventually might run out of device space.

Yes, I was talking to @fabrice about the issue and he suggested exactly the same too! I’m using a SanDisk 64GB Extreme microSDXC UHS-I A2 (Read 4000 IOPS, Write 2000 IOPS) as storage on the Rasperry Pi 4, so roughly 30 MB/s (minimum sequential writing speed).

Yes, the only two processes running in parallel on the Raspberry Pi 4 are the MainNet and TestNet nodes. These are two snapshots of hardware load in different moments:

I also ran into a similar problem a few days back for the mainnet instance (archival Debian node). It was only a couple of hundred thousand blocks away from being fully sync’d. (I was guttted!)

My first thought was that my ancient machine had just run out of resources to cope with the size of the mainnet. On closer inspection the problem looked more like something was wrong with the algorand code or configuration.

The symptoms were that the machine became unresponsive to the point that I had to power off the machine to be able to intervene and stop algod immediately after booting up otherwise I would be locked out of the console.

After shifting node.log out of the way and restarting the mainnet instance node.log showed:

Logging Starting
Telemetry Enabled: 54ab6020-52cf-444d-be74-155cc5450220
Session:
++++++++++++++++++++++++++++++++++++++++
{“callee”:“github.com/algorand/go-algorand/ledger.(*blockQueue).getBlock.func1”,“caller”:"/root/go/src/github.com/algorand/go-algorand/ledger/blockqueue.go:247",“file”:“dbutil.go”,“function”:“github.com/algorand/go-algorand/util/db.(*Accessor).Atomic”,“level”:“warning”,“line”:272,“msg”:“dbatomic: tx took 4.284393143s”,“name”:"",“readonly”:true,“time”:“2020-08-18T09:33:28.068428+01:00”}
{“callee”:“github.com/algorand/go-algorand/ledger.(*blockQueue).getBlock.func1”,“caller”:"/root/go/src/github.com/algorand/go-algorand/ledger/blockqueue.go:247",“file”:“dbutil.go”,“function”:“github.com/algorand/go-algorand/util/db.(*Accessor).Atomic”,“level”:“warning”,“line”:272,“msg”:“dbatomic: tx took 4.86935983s”,“name”:"",“readonly”:true,“time”:“2020-08-18T09:34:35.065294+01:00”}
{“callee”:“github.com/algorand/go-algorand/ledger.(*blockQueue).getBlock.func1”,“caller”:"/root/go/src/github.com/algorand/go-algorand/ledger/blockqueue.go:247",“file”:“dbutil.go”,“function”:“github.com/algorand/go-algorand/util/db.(*Accessor).Atomic”,“level”:“warning”,“line”:272,“msg”:“dbatomic: tx took 2.603299742s”,“name”:"",“readonly”:true,“time”:“2020-08-18T09:35:25.537086+01:00”}

As I’m currently experimenting with the speed of synchronisation I decided to trash my installation and start again.

I moved /var/lib/algorand out of the way and ran the Debian installation commands again - initially without removing the algorand package.

When I got to

sudo add-apt-repository “deb https://releases.algorand.com/deb/ stable main”

I noticed something that I don’t recall seeing first time around:

$ sudo add-apt-repository “deb https://releases.algorand.com/deb/ stable main”
Hit:1 http://gb.archive.ubuntu.com/ubuntu bionic InRelease
Get:2 http://gb.archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:3 http://packages.microsoft.com/repos/vscode stable InRelease
Get:4 http://gb.archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:5 https://download.docker.com/linux/ubuntu bionic InRelease
Get:6 https://releases.algorand.com/deb stable InRelease [2,995 B]
Hit:7 http://dl.google.com/linux/chrome/deb stable InRelease
Get:8 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Reading package lists… Done
E: Repository ‘https://releases.algorand.com/deb stable InRelease’ changed its ‘Origin’ value from ‘algorand’ to 'Algorand’
E: Repository ‘https://releases.algorand.com/deb stable InRelease’ changed its ‘Label’ value from ‘algorand’ to 'Algorand’
N: This must be accepted explicitly before updates for this repository can be applied. See apt-secure(8) manpage for details.

What is the cause of the change from lowercase ‘a’ to uppercase ‘A’?

Might this be causing my problems?

I continued with the installation (2.0.8.stable) but the mainnet node still refused to function so I removed the algorand package and cleaned out /var/lib/algorand:

sudo apt-get remove algorand
sudo rm -rf /var/lib/algorand/*

and began the installation again (I double checked the permissions on /var/lib/algorand and they are “drwxr-xr-x” with owner and group set as “algorand”.

I still got the same warning about the change from algorand to Algorand - and still the node would not run. A quick look in /var/log/syslog showed that algorand.service was aborting due to “cannot read genesis file /var/lib/algorand/genesis.json”. All that the installation had put in /var/lib/algorand was config.json.example and the genesis directory.

Next step was to remove and purge the algorand package:

sudo apt-get remove algorand
sudo apt-get purge algorand

This threw up no errors but when I next ran

sudo apt-get update

It displayed the same error:

E: Repository ‘https://releases.algorand.com/deb stable InRelease’ changed its ‘Origin’ value from ‘algorand’ to ‘Algorand’
E: Repository ‘https://releases.algorand.com/deb stable InRelease’ changed its ‘Label’ value from ‘algorand’ to ‘Algorand’
N: This must be accepted explicitly before updates for this repository can be applied. See apt-secure(8) manpage for details.

It seems that the case change has screwed with the APT signature checking and I don’t seem to be able to fully purge the system of the algorand package.

In my case Ubuntu 16LTS became unresponsive. The symptoms were very similar:

Blockquote
the machine became unresponsive to the point that I had to power off the machine to be able to intervene and stop algod immediately after booting up otherwise I would be locked out of the console.

I noticed that the system began to swap. You can check it, too, using the command top.

I changed the maximum usable memory from 4G to 8G, and that solved my problem.
(I use VirtualBox running under Win10).

Memory constraints were my first thought - thanks for sharing your experience.

However, I still have the problem that I am unable to install a new working instance having attempted to remove the algorand package from my machine. A new, non-archival instance sould not be consuming so much more memory that it would kill my machine - unless the upgrade to 2.0.8 has resulted in a much more resource-heavy process.

I’m still puzzled by the change from “algorand” to “Algorand” being reported by the package manager.

I believe that the package name was reverted to preserve functionality.

Also, please re-evaluate with the latest release ( 2.1.3 ). It has a longer one-time startup, and should consume less memory afterward.

Hi all! We also faced issues in the new version.

In one of the computers, we could recover by manually restarting it but in other, no luck. goal node start gets hung and node stop cannot kill it.

Algod process is started with telemetry but restAPI is not mounted. Error log is empty. Tried removing catchup folder, shm files without success.

Regards,
Mauro.

The REST API not being mounted typically means that the node is still starting up.
On certain configuration ( especially if there is no fast storage ), it can take a long time.

try to monitor the log file while it’s starting.

Note that archival node can take much longer than non-archival nodes.