Our node was seeing memory issues. Migrated to a new machine installed 2.1.3 and tried Fast Catchup. Node is non-archival. The process consumes large amounts of memory (over 7GB on our machine) and fails after downloading the Cachpoint blocks.
We don’t see any obvious error in the logs when searching through and the node seems to gracefully fall back into the regular sync (very obvious due to the different CPU and memory profile). We’ve already re-imaged the machine and started the process 4 different times, increasing resources each time.
You can rest assured that the log file does contain information for “why the fast catchup aborted”. Try to grep for the word “catchpoint”. If it’s due to too many failures while downloading blocks, you can increase the max number of failed attempts in the config.json file.
We were filtering on “error” or “fatal” levels These should be the messages from the last 3 attempts.
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).processStageLedgerDownload","level":"warning","line":293,"msg":"unable to download ledger : http: unexpected EOF reading trailer","name":"","time":"2020-08-19T20:35:59.053312Z"}
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).run","level":"warning","line":201,"msg":"catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts","name":"","time":"2020-08-19T21:22:01.959916Z"}
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).processStageLedgerDownload","level":"warning","line":293,"msg":"unable to download ledger : No ledger available for given round","name":"","time":"2020-08-19T21:25:41.817713Z"}
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).run","level":"warning","line":201,"msg":"catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts","name":"","time":"2020-08-19T22:18:17.268793Z"}
If we were to try again, what settings would be advisable beyond the defaults? The default number of retries seems high already. And lastly, do we remove the data directory and start clean or just change the config and restart the node with the updated configuration before we trigger the catchup?
According to your previous status, it looks like you were able to download more than 600 out of 1000 of the blocks. Why dont you quadruple the number of failuire attemps and try again ?
No need to delete the data directory.
The high number of errors is because not all the relays completed the upgrade. It will get better in few days.
Try to set CatchupBlockDownloadRetryAttempts to 4000 in the config.json and see if that helps.
Thank you @tsachi, that seems to have got through and now working on the normal sync from the catch point to the current block. For future reference, Fast Catchup took around 33 minutes and algod memory utilization scratched 7GB for 3/4 of the time.
then I resumed the process but it started again from the beginning (loosing the accounts processed) and failed again at the same processed accounts number:
$ goal node status
Last committed block: 140
Sync Time: 1197.3s
Catchpoint: 8560000#JXTHWW5BRW7MDSPVHITBLYORCNUDTHIXZRIYOZYKHDHGU2WUXLZQ
Catchpoint total accounts: 4313466
Catchpoint accounts processed: 1113600
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
Cannot contact Algorand node: Get http://127.0.0.1:8080/v2/status: dial tcp 127.0.0.1:8080: connect: connection refused.
Hey @cusma. That is exactly what we were seeing. In our case, algod was continuing to request memory until at some point it failed and the process(es) crashed making the REST endpoint unavailable. Definitely follow @tsachi’s advice and look into your logs to track down the issue.
Followed what @tsachi suggested, setting CatchupBlockDownloadRetryAttempts to 4000 in the config.json the process passed the previous 1113600 processed accounts and got stuck at 1244672:
Fast catchup reset after node restart:
That’s true. Once started, it would retry up to the configured number of attempts before giving up. On each attemp, it start from the last “step”, so if it was downloading the accounts, it would reset these and try again.
Memory consumption during the catchup process is higher than desirable. I’ll try to reproduce it locally. It was never an issue for me, but the number of accounts are growing rapidally, so the tests which i conducted might have become obsolete.
Could be possible to partialize the catchup process into minor steps providing a set of catchup points with individual requirements lower than the latest one?
My node is also failing to succeed with the Fast Catchup on mainnet.
After 6 hours it didn’t seem to be getting any further. The syptoms visible in the status were that the sync time continued to climb, the “Catchpoint accounts processed” climbed for a while and then started back at the beginning to climb a bit higher the next time. After doing this several times it seemed unable to get passed 1113600 processed catchpoint accounts.
You’re correct about that setting.
If you add:
“CatchupBlockDownloadRetryAttempts”: 5000
to your config.json (possibly higher - the default is 1000) then fast catchup should finish. It will retry against other nodes and succeed for the failed blocks.
The problem is that not all mainnet relay nodes have been upgraded, so when it tries to fetch blocks from some of the relay nodes it will fail. There’s a default limit to block download failures and when it hits that limit it will abort the whole thing.
I will certainly try to increase the retry attempts, but I’m also going to need to recofigure my old test laptop which currently has 4GB RAM and 3GB swap. It was thrashing to the point of timeout and if it really needs 7GB of virtual memory then it will still not work until there is a way of limiting the amount of memory algod can consume.
I would need to perform some more tests but it seems that the amount of memory required by algod is proportional to the number of accounts (presumably loaded into a memory resident DB). Memory utilsation appears to be significantly inflated by the fast catchup process.
Hi @tsachi, I’ve recently changed my node’s hardware with a Raspberry Pi 4 8GB RAM, this time I mounted a Raspberry Pi OS 64bit (so each process is no more limited to 3GB RAM).
I’m still not able to fast chatchup the MainNet.
With this new setup, memory overload errors seem to be disappeared but during the Fast Catchup the Processed Accounts number restart from the beginning all of a sudden.
I’ve already set CatchupBlockDownloadRetryAttempts to 8000 in the config.json as you suggested.
Let me know if my node.log could be useful and where I can send it.
@cusma, I did not know a 8GB Raspberry Pi exists… that makes me curious : does the arm6v code ( which we compiled for all the Raspberry Pi ) supports more than 4Gb ? I thought that it’s limited to 4GB.
Regarding the memory consumption issue, I can confirm that it does seems to be correlated with the large account number on mainnet. The recent betanet release (2.1.4) did not provided a fix for this issue, unfortunately (BetaNet Update: Version 2.1.4 )
To get access to the whole address space @cusma would have to build go-algorand targeting armv8, not just mount a 64-bit OS. The armv6 and armv7 ISAs are 32-bit only.
@tsachi Raspberry Pi 4 8GB has been released in May by Raspberry Pi Foundation, the PCB is slightly different from old versions due to thermal management improvements.
@jimmy as always I just followed this node installation guide for Linux distros on Raspberry Pi OS 64-bit (still in beta, old Raspbian was recently deprecated by Raspberry Pi Foundation switching to Raspberry Pi OS). Does this procedure automatically install arm6v?
So, help me undestand if is there any action I can take to build the correct version of go-alogrand in order to leverage all 8GB RAM without limitations.
and then runnning the installer from node directory, isn’t it?
Doubt: the fact that the Processed Accounts number resets during the catchup without having a memory fatal error (which I had with the previous 4GB 32bit node setup) is still related to RAM issues?