Fast Catchup not working for our node on MainNet

jimmy · August 19, 2020, 11:30pm

Our node was seeing memory issues. Migrated to a new machine installed 2.1.3 and tried Fast Catchup. Node is non-archival. The process consumes large amounts of memory (over 7GB on our machine) and fails after downloading the Cachpoint blocks.

Here is a snapshot of the catchup in flight:

$ algod -v
8590000131
2.1.3.stable [rel/stable] (commit #30c8dd68)
$ cat /var/lib/algorand/config.json
{
        "Version": 11,
        "EnableMetricReporting": true
}
$ goal node catchup 8550000#MYSKTQ7KYLYLPOU275ZUBPMOBAUMI7OYZY2AZSSAJGIQQIWO4OAQ -d /var/lib/algorand
$ goal node status -w 1000
Last committed block: 11130
Sync Time: 2883.3s
Catchpoint: 8550000#MYSKTQ7KYLYLPOU275ZUBPMOBAUMI7OYZY2AZSSAJGIQQIWO4OAQ
Catchpoint total accounts: 4174171
Catchpoint accounts processed: 4174171
Catchpoint total blocks: 1000
Catchpoint downloaded blocks: 653
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

We don’t see any obvious error in the logs when searching through and the node seems to gracefully fall back into the regular sync (very obvious due to the different CPU and memory profile). We’ve already re-imaged the machine and started the process 4 different times, increasing resources each time.

Any insights or assistance is appreciated.

-jimmy

tsachi · August 20, 2020, 12:08am

You can rest assured that the log file does contain information for “why the fast catchup aborted”. Try to grep for the word “catchpoint”. If it’s due to too many failures while downloading blocks, you can increase the max number of failed attempts in the config.json file.

jimmy · August 20, 2020, 12:34am

We were filtering on “error” or “fatal” levels These should be the messages from the last 3 attempts.

{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).processStageLedgerDownload","level":"warning","line":293,"msg":"unable to download ledger : http: unexpected EOF reading trailer","name":"","time":"2020-08-19T20:35:59.053312Z"}
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).run","level":"warning","line":201,"msg":"catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts","name":"","time":"2020-08-19T21:22:01.959916Z"}
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).processStageLedgerDownload","level":"warning","line":293,"msg":"unable to download ledger : No ledger available for given round","name":"","time":"2020-08-19T21:25:41.817713Z"}
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).run","level":"warning","line":201,"msg":"catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts","name":"","time":"2020-08-19T22:18:17.268793Z"}

If we were to try again, what settings would be advisable beyond the defaults? The default number of retries seems high already. And lastly, do we remove the data directory and start clean or just change the config and restart the node with the updated configuration before we trigger the catchup?

Thanks for the advice,
-j

tsachi · August 20, 2020, 12:48am

According to your previous status, it looks like you were able to download more than 600 out of 1000 of the blocks. Why dont you quadruple the number of failuire attemps and try again ?
No need to delete the data directory.

The high number of errors is because not all the relays completed the upgrade. It will get better in few days.

Try to set CatchupBlockDownloadRetryAttempts to 4000 in the config.json and see if that helps.

jimmy · August 20, 2020, 4:37am

Thank you @tsachi, that seems to have got through and now working on the normal sync from the catch point to the current block. For future reference, Fast Catchup took around 33 minutes and algod memory utilization scratched 7GB for 3/4 of the time.

-j

cusma · August 20, 2020, 9:58am

I decided to reinstall a fresh node so I was trying to sync it using the Fast Catchup, during the processes something gone wrong and the sync stopped:

$ goal node status
Last committed block: 140
Sync Time: 1039.3s
Catchpoint: 8560000#JXTHWW5BRW7MDSPVHITBLYORCNUDTHIXZRIYOZYKHDHGU2WUXLZQ
Catchpoint total accounts: 4313466
Catchpoint accounts processed: 1113600
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
Cannot contact Algorand node: Get http://127.0.0.1:8080/v2/status: dial tcp 127.0.0.1:8080: connect: connection refused.

I manually restart the node:

$ goal node start
Algorand node successfully started!

then I resumed the process but it started again from the beginning (loosing the accounts processed) and failed again at the same processed accounts number:

$ goal node status
Last committed block: 140
Sync Time: 1197.3s
Catchpoint: 8560000#JXTHWW5BRW7MDSPVHITBLYORCNUDTHIXZRIYOZYKHDHGU2WUXLZQ
Catchpoint total accounts: 4313466
Catchpoint accounts processed: 1113600
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
Cannot contact Algorand node: Get http://127.0.0.1:8080/v2/status: dial tcp 127.0.0.1:8080: connect: connection refused.

Any idea on what is causing this failure?

jimmy · August 20, 2020, 10:18am

Hey @cusma. That is exactly what we were seeing. In our case, algod was continuing to request memory until at some point it failed and the process(es) crashed making the REST endpoint unavailable. Definitely follow @tsachi’s advice and look into your logs to track down the issue.

-j

cusma · August 20, 2020, 10:52am

Followed what @tsachi suggested, setting CatchupBlockDownloadRetryAttempts to 4000 in the config.json the process passed the previous 1113600 processed accounts and got stuck at 1244672:

$ goal node status -w 1000
Last committed block: 415
Sync Time: 1266.7s
Catchpoint: 8560000#JXTHWW5BRW7MDSPVHITBLYORCNUDTHIXZRIYOZYKHDHGU2WUXLZQ
Catchpoint total accounts: 4313466
Catchpoint accounts processed: 1244672
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
Cannot contact Algorand node: Get http://127.0.0.1:8080/v2/status: dial tcp 127.0.0.1:8080: connect: connection refused.

I’m trying again increasing CatchupBlockDownloadRetryAttempts up to 8000 and see if it succeed.

cusma · August 20, 2020, 11:11am

Didn’t work, process failed again at 1113600 processed accounts even setting CatchupBlockDownloadRetryAttempts to 8000:

$ goal node status -w 1000
Last committed block: 488
Sync Time: 1066.3s
Catchpoint: 8560000#JXTHWW5BRW7MDSPVHITBLYORCNUDTHIXZRIYOZYKHDHGU2WUXLZQ
Catchpoint total accounts: 4313466
Catchpoint accounts processed: 1113600
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
Cannot contact Algorand node: Get http://127.0.0.1:8080/v2/status: dial tcp 127.0.0.1:8080: connect: connection refused.

Looking into algod-err.log seems that I have a memory issue:

runtime: out of memory: cannot allocate 98304-byte block (2838134784 in use)
fatal error: out of memory

runtime stack:
runtime.throw(0xb5d048ca, 0xd)
	/usr/local/go/src/runtime/panic.go:617 +0x60
runtime.largeAlloc(0x16b68, 0xb5100101, 0xba8)
	/usr/local/go/src/runtime/malloc.go:1057 +0x14c
runtime.mallocgc.func1()
	/usr/local/go/src/runtime/malloc.go:950 +0x38
runtime.systemstack(0xb6f0c9f0)
	/usr/local/go/src/runtime/asm_arm.s:354 +0x90
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1153

tsachi · August 20, 2020, 12:44pm

Two issues that were mentioned here :

Fast catchup reset after node restart:
That’s true. Once started, it would retry up to the configured number of attempts before giving up. On each attemp, it start from the last “step”, so if it was downloading the accounts, it would reset these and try again.
Memory consumption during the catchup process is higher than desirable. I’ll try to reproduce it locally. It was never an issue for me, but the number of accounts are growing rapidally, so the tests which i conducted might have become obsolete.

cusma · August 20, 2020, 1:10pm

Could be possible to partialize the catchup process into minor steps providing a set of catchup points with individual requirements lower than the latest one?

rmb · August 20, 2020, 3:56pm

My node is also failing to succeed with the Fast Catchup on mainnet.

After 6 hours it didn’t seem to be getting any further. The syptoms visible in the status were that the sync time continued to climb, the “Catchpoint accounts processed” climbed for a while and then started back at the beginning to climb a bit higher the next time. After doing this several times it seemed unable to get passed 1113600 processed catchpoint accounts.

$ goal node status -d /var/lib/algorand -w 1000
Last committed block: 8570
Sync Time: 2382.1s
Catchpoint: 8560000#JXTHWW5BRW7MDSPVHITBLYORCNUDTHIXZRIYOZYKHDHGU2WUXLZQ
Catchpoint total accounts: 4313466
Catchpoint accounts processed: 1048064
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
^C

$ goal node status -d /var/lib/algorand -w 1000
Last committed block: 8570
Sync Time: 22162.2s
Catchpoint: 8560000#JXTHWW5BRW7MDSPVHITBLYORCNUDTHIXZRIYOZYKHDHGU2WUXLZQ
Catchpoint total accounts: 4313466
Catchpoint accounts processed: 1113600
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
^C

I gave up and aborted the catchup and stopped algod.

A quick look in node.log showed repeated messages along the lines of:

{“callee”:“github.com/algorand/go-algorand/ledger.(*CatchpointCatchupAccessorImpl) .processStagingBalances.func1”,“caller”:“/root/go/s
rc/github.com/algorand/go-algorand/ledger/catchupaccessor.go:294”,“file”:“dbutil.go”,“function”:“github.com/algorand/go-algorand/util
/db.(*Accessor).atomic”,“level”:“warning”,“line”:355,“msg”:“dbatomic: tx surpassed expected deadline by 1.131643661s”,“name”:“”,“read
only”:false,“time”:“2020-08-20T11:12:38.108569+01:00”}
{“file”:“catchpointService.go”,“function”:“github.com/algorand/go-algorand/catch
up.(*CatchpointCatchupService).processStageLedgerDownload”,“level”:“warning”,“li
ne”:293,“msg”:“unable to download ledger : unexpected EOF”,“name”:“”,“time”:“202
0-08-20T12:22:31.222640+01:00”}

I see @cusma has the same symptoms.

From what @jimmy has posted it sounds as though there may be little hope for my ancient node untl I give it a memory upgrade beyond 4GB.

I’ll try increasing the CatchupBlockDownloadRetryAttempts - perhaps to 8000 and let you know how it goes.

I’ve not yet found any documentation for recomended node specifications - can you point me to it?

aojjazz · August 22, 2020, 5:20am

You’re correct about that setting.
If you add:
“CatchupBlockDownloadRetryAttempts”: 5000
to your config.json (possibly higher - the default is 1000) then fast catchup should finish. It will retry against other nodes and succeed for the failed blocks.

The problem is that not all mainnet relay nodes have been upgraded, so when it tries to fetch blocks from some of the relay nodes it will fail. There’s a default limit to block download failures and when it hits that limit it will abort the whole thing.

rmb · August 26, 2020, 11:45am

I will certainly try to increase the retry attempts, but I’m also going to need to recofigure my old test laptop which currently has 4GB RAM and 3GB swap. It was thrashing to the point of timeout and if it really needs 7GB of virtual memory then it will still not work until there is a way of limiting the amount of memory algod can consume.

I would need to perform some more tests but it seems that the amount of memory required by algod is proportional to the number of accounts (presumably loaded into a memory resident DB). Memory utilsation appears to be significantly inflated by the fast catchup process.

cusma · August 28, 2020, 5:09pm

Hi @tsachi, I’ve recently changed my node’s hardware with a Raspberry Pi 4 8GB RAM, this time I mounted a Raspberry Pi OS 64bit (so each process is no more limited to 3GB RAM).

I’m still not able to fast chatchup the MainNet.

With this new setup, memory overload errors seem to be disappeared but during the Fast Catchup the Processed Accounts number restart from the beginning all of a sudden.

I’ve already set CatchupBlockDownloadRetryAttempts to 8000 in the config.json as you suggested.

Let me know if my node.log could be useful and where I can send it.

tsachi · August 29, 2020, 12:52am

@cusma, I did not know a 8GB Raspberry Pi exists… that makes me curious : does the arm6v code ( which we compiled for all the Raspberry Pi ) supports more than 4Gb ? I thought that it’s limited to 4GB.

Regarding the memory consumption issue, I can confirm that it does seems to be correlated with the large account number on mainnet. The recent betanet release (2.1.4) did not provided a fix for this issue, unfortunately (BetaNet Update: Version 2.1.4 )

jimmy · August 29, 2020, 9:28am

To get access to the whole address space @cusma would have to build go-algorand targeting armv8, not just mount a 64-bit OS. The armv6 and armv7 ISAs are 32-bit only.

cusma · August 29, 2020, 10:38am

@tsachi Raspberry Pi 4 8GB has been released in May by Raspberry Pi Foundation, the PCB is slightly different from old versions due to thermal management improvements.

@jimmy as always I just followed this node installation guide for Linux distros on Raspberry Pi OS 64-bit (still in beta, old Raspbian was recently deprecated by Raspberry Pi Foundation switching to Raspberry Pi OS). Does this procedure automatically install arm6v?

So, help me undestand if is there any action I can take to build the correct version of go-alogrand in order to leverage all 8GB RAM without limitations.

tsachi · August 29, 2020, 3:11pm

I haven’t tested that, but there is an official release for arm64, which might work for the above OS.

cusma · August 29, 2020, 3:32pm

@tsachi I think is exactly what I get doing:

wget https://raw.githubusercontent.com/algorand/go-algorand-doc/master/downloads/installers/update.sh

and then runnning the installer from node directory, isn’t it?

Doubt: the fact that the Processed Accounts number resets during the catchup without having a memory fatal error (which I had with the previous 4GB 32bit node setup) is still related to RAM issues?

Topic		Replies	Views
Fast catchup not synchronizing General	2	436	June 15, 2021
Error on doing fast catchup TestNet	7	690	January 9, 2021
Fast catchup does not work on Raspberry Pi 4B General mainnet	24	1262	May 24, 2021
Mainnet Fast Catchup Stalling General fast-catchup	6	444	February 7, 2022
Sync Algorand Node Network using Fast Catchup (default install) on Raspberry Pi 4 Computer Model B 4GB RAM on 32 GB MicroSD AlgoSDK mainnet , noderunner , fast-catchup	8	1071	June 14, 2021

Fast Catchup not working for our node on MainNet

Related topics