Fast Catchup not working for our node on MainNet

Our node was seeing memory issues. Migrated to a new machine installed 2.1.3 and tried Fast Catchup. Node is non-archival. The process consumes large amounts of memory (over 7GB on our machine) and fails after downloading the Cachpoint blocks.

Here is a snapshot of the catchup in flight:

$ algod -v
8590000131
2.1.3.stable [rel/stable] (commit #30c8dd68)
$ cat /var/lib/algorand/config.json
{
        "Version": 11,
        "EnableMetricReporting": true
}
$ goal node catchup 8550000#MYSKTQ7KYLYLPOU275ZUBPMOBAUMI7OYZY2AZSSAJGIQQIWO4OAQ -d /var/lib/algorand
$ goal node status -w 1000
Last committed block: 11130
Sync Time: 2883.3s
Catchpoint: 8550000#MYSKTQ7KYLYLPOU275ZUBPMOBAUMI7OYZY2AZSSAJGIQQIWO4OAQ
Catchpoint total accounts: 4174171
Catchpoint accounts processed: 4174171
Catchpoint total blocks: 1000
Catchpoint downloaded blocks: 653
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

We don’t see any obvious error in the logs when searching through and the node seems to gracefully fall back into the regular sync (very obvious due to the different CPU and memory profile). We’ve already re-imaged the machine and started the process 4 different times, increasing resources each time.

Any insights or assistance is appreciated.

-jimmy

You can rest assured that the log file does contain information for “why the fast catchup aborted”. Try to grep for the word “catchpoint”. If it’s due to too many failures while downloading blocks, you can increase the max number of failed attempts in the config.json file.

We were filtering on “error” or “fatal” levels :expressionless: These should be the messages from the last 3 attempts.

{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).processStageLedgerDownload","level":"warning","line":293,"msg":"unable to download ledger : http: unexpected EOF reading trailer","name":"","time":"2020-08-19T20:35:59.053312Z"}
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).run","level":"warning","line":201,"msg":"catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts","name":"","time":"2020-08-19T21:22:01.959916Z"}
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).processStageLedgerDownload","level":"warning","line":293,"msg":"unable to download ledger : No ledger available for given round","name":"","time":"2020-08-19T21:25:41.817713Z"}
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).run","level":"warning","line":201,"msg":"catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts","name":"","time":"2020-08-19T22:18:17.268793Z"}

If we were to try again, what settings would be advisable beyond the defaults? The default number of retries seems high already. And lastly, do we remove the data directory and start clean or just change the config and restart the node with the updated configuration before we trigger the catchup?

Thanks for the advice,
-j

According to your previous status, it looks like you were able to download more than 600 out of 1000 of the blocks. Why dont you quadruple the number of failuire attemps and try again ?
No need to delete the data directory.

The high number of errors is because not all the relays completed the upgrade. It will get better in few days.

Try to set CatchupBlockDownloadRetryAttempts to 4000 in the config.json and see if that helps.

Thank you @tsachi, that seems to have got through and now working on the normal sync from the catch point to the current block. For future reference, Fast Catchup took around 33 minutes and algod memory utilization scratched 7GB for 3/4 of the time.

-j

I decided to reinstall a fresh node so I was trying to sync it using the Fast Catchup, during the processes something gone wrong and the sync stopped:

$ goal node status
Last committed block: 140
Sync Time: 1039.3s
Catchpoint: 8560000#JXTHWW5BRW7MDSPVHITBLYORCNUDTHIXZRIYOZYKHDHGU2WUXLZQ
Catchpoint total accounts: 4313466
Catchpoint accounts processed: 1113600
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
Cannot contact Algorand node: Get http://127.0.0.1:8080/v2/status: dial tcp 127.0.0.1:8080: connect: connection refused.

I manually restart the node:

$ goal node start
Algorand node successfully started!

then I resumed the process but it started again from the beginning (loosing the accounts processed) and failed again at the same processed accounts number:

$ goal node status
Last committed block: 140
Sync Time: 1197.3s
Catchpoint: 8560000#JXTHWW5BRW7MDSPVHITBLYORCNUDTHIXZRIYOZYKHDHGU2WUXLZQ
Catchpoint total accounts: 4313466
Catchpoint accounts processed: 1113600
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
Cannot contact Algorand node: Get http://127.0.0.1:8080/v2/status: dial tcp 127.0.0.1:8080: connect: connection refused.

Any idea on what is causing this failure?

Hey @cusma. That is exactly what we were seeing. In our case, algod was continuing to request memory until at some point it failed and the process(es) crashed making the REST endpoint unavailable. Definitely follow @tsachi’s advice and look into your logs to track down the issue.

-j

Followed what @tsachi suggested, setting CatchupBlockDownloadRetryAttempts to 4000 in the config.json the process passed the previous 1113600 processed accounts and got stuck at 1244672:

$ goal node status -w 1000
Last committed block: 415
Sync Time: 1266.7s
Catchpoint: 8560000#JXTHWW5BRW7MDSPVHITBLYORCNUDTHIXZRIYOZYKHDHGU2WUXLZQ
Catchpoint total accounts: 4313466
Catchpoint accounts processed: 1244672
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
Cannot contact Algorand node: Get http://127.0.0.1:8080/v2/status: dial tcp 127.0.0.1:8080: connect: connection refused.

I’m trying again increasing CatchupBlockDownloadRetryAttempts up to 8000 and see if it succeed.

Didn’t work, process failed again at 1113600 processed accounts even setting CatchupBlockDownloadRetryAttempts to 8000:

$ goal node status -w 1000
Last committed block: 488
Sync Time: 1066.3s
Catchpoint: 8560000#JXTHWW5BRW7MDSPVHITBLYORCNUDTHIXZRIYOZYKHDHGU2WUXLZQ
Catchpoint total accounts: 4313466
Catchpoint accounts processed: 1113600
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
Cannot contact Algorand node: Get http://127.0.0.1:8080/v2/status: dial tcp 127.0.0.1:8080: connect: connection refused.

Looking into algod-err.log seems that I have a memory issue:

runtime: out of memory: cannot allocate 98304-byte block (2838134784 in use)
fatal error: out of memory

runtime stack:
runtime.throw(0xb5d048ca, 0xd)
	/usr/local/go/src/runtime/panic.go:617 +0x60
runtime.largeAlloc(0x16b68, 0xb5100101, 0xba8)
	/usr/local/go/src/runtime/malloc.go:1057 +0x14c
runtime.mallocgc.func1()
	/usr/local/go/src/runtime/malloc.go:950 +0x38
runtime.systemstack(0xb6f0c9f0)
	/usr/local/go/src/runtime/asm_arm.s:354 +0x90
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1153

Two issues that were mentioned here :

  1. Fast catchup reset after node restart:
    That’s true. Once started, it would retry up to the configured number of attempts before giving up. On each attemp, it start from the last “step”, so if it was downloading the accounts, it would reset these and try again.
  2. Memory consumption during the catchup process is higher than desirable. I’ll try to reproduce it locally. It was never an issue for me, but the number of accounts are growing rapidally, so the tests which i conducted might have become obsolete.

Could be possible to partialize the catchup process into minor steps providing a set of catchup points with individual requirements lower than the latest one?

My node is also failing to succeed with the Fast Catchup on mainnet.

After 6 hours it didn’t seem to be getting any further. The syptoms visible in the status were that the sync time continued to climb, the “Catchpoint accounts processed” climbed for a while and then started back at the beginning to climb a bit higher the next time. After doing this several times it seemed unable to get passed 1113600 processed catchpoint accounts.

$ goal node status -d /var/lib/algorand -w 1000
Last committed block: 8570
Sync Time: 2382.1s
Catchpoint: 8560000#JXTHWW5BRW7MDSPVHITBLYORCNUDTHIXZRIYOZYKHDHGU2WUXLZQ
Catchpoint total accounts: 4313466
Catchpoint accounts processed: 1048064
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
^C

$ goal node status -d /var/lib/algorand -w 1000
Last committed block: 8570
Sync Time: 22162.2s
Catchpoint: 8560000#JXTHWW5BRW7MDSPVHITBLYORCNUDTHIXZRIYOZYKHDHGU2WUXLZQ
Catchpoint total accounts: 4313466
Catchpoint accounts processed: 1113600
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
^C

I gave up and aborted the catchup and stopped algod.

A quick look in node.log showed repeated messages along the lines of:

{“callee”:“github.com/algorand/go-algorand/ledger.(*CatchpointCatchupAccessorImpl) .processStagingBalances.func1”,“caller”:"/root/go/s
rc/github.com/algorand/go-algorand/ledger/catchupaccessor.go:294",“file”:“dbutil.go”,“function”:“github.com/algorand/go-algorand/util
/db.(*Accessor).atomic”,“level”:“warning”,“line”:355,“msg”:“dbatomic: tx surpassed expected deadline by 1.131643661s”,“name”:"",“read
only”:false,“time”:“2020-08-20T11:12:38.108569+01:00”}
{“file”:“catchpointService.go”,“function”:“github.com/algorand/go-algorand/catch
up.(*CatchpointCatchupService).processStageLedgerDownload”,“level”:“warning”,“li
ne”:293,“msg”:“unable to download ledger : unexpected EOF”,“name”:"",“time”:“202
0-08-20T12:22:31.222640+01:00”}

I see @cusma has the same symptoms.

From what @jimmy has posted it sounds as though there may be little hope for my ancient node untl I give it a memory upgrade beyond 4GB.

I’ll try increasing the CatchupBlockDownloadRetryAttempts - perhaps to 8000 and let you know how it goes.

I’ve not yet found any documentation for recomended node specifications - can you point me to it?

1 Like

You’re correct about that setting.
If you add:
“CatchupBlockDownloadRetryAttempts”: 5000
to your config.json (possibly higher - the default is 1000) then fast catchup should finish. It will retry against other nodes and succeed for the failed blocks.

The problem is that not all mainnet relay nodes have been upgraded, so when it tries to fetch blocks from some of the relay nodes it will fail. There’s a default limit to block download failures and when it hits that limit it will abort the whole thing.

I will certainly try to increase the retry attempts, but I’m also going to need to recofigure my old test laptop which currently has 4GB RAM and 3GB swap. It was thrashing to the point of timeout and if it really needs 7GB of virtual memory then it will still not work until there is a way of limiting the amount of memory algod can consume.

I would need to perform some more tests but it seems that the amount of memory required by algod is proportional to the number of accounts (presumably loaded into a memory resident DB). Memory utilsation appears to be significantly inflated by the fast catchup process.

Hi @tsachi, I’ve recently changed my node’s hardware with a Raspberry Pi 4 8GB RAM, this time I mounted a Raspberry Pi OS 64bit (so each process is no more limited to 3GB RAM).

I’m still not able to fast chatchup the MainNet.

With this new setup, memory overload errors seem to be disappeared but during the Fast Catchup the Processed Accounts number restart from the beginning all of a sudden.

I’ve already set CatchupBlockDownloadRetryAttempts to 8000 in the config.json as you suggested.

Let me know if my node.log could be useful and where I can send it.

@cusma, I did not know a 8GB Raspberry Pi exists… that makes me curious : does the arm6v code ( which we compiled for all the Raspberry Pi ) supports more than 4Gb ? I thought that it’s limited to 4GB.

Regarding the memory consumption issue, I can confirm that it does seems to be correlated with the large account number on mainnet. The recent betanet release (2.1.4) did not provided a fix for this issue, unfortunately (BetaNet Update: Version 2.1.4 )

To get access to the whole address space @cusma would have to build go-algorand targeting armv8, not just mount a 64-bit OS. The armv6 and armv7 ISAs are 32-bit only.

@tsachi Raspberry Pi 4 8GB has been released in May by Raspberry Pi Foundation, the PCB is slightly different from old versions due to thermal management improvements.

@jimmy as always I just followed this node installation guide for Linux distros on Raspberry Pi OS 64-bit (still in beta, old Raspbian was recently deprecated by Raspberry Pi Foundation switching to Raspberry Pi OS). Does this procedure automatically install arm6v?

So, help me undestand if is there any action I can take to build the correct version of go-alogrand in order to leverage all 8GB RAM without limitations.

I haven’t tested that, but there is an official release for arm64, which might work for the above OS.

@tsachi I think is exactly what I get doing:

wget https://raw.githubusercontent.com/algorand/go-algorand-doc/master/downloads/installers/update.sh

and then runnning the installer from node directory, isn’t it?

Doubt: the fact that the Processed Accounts number resets during the catchup without having a memory fatal error (which I had with the previous 4GB 32bit node setup) is still related to RAM issues?