Can't complete catchup operation on mainnet -- timeout after 45 minutes (~4m accounts)

worrymite · November 8, 2020, 11:35pm

I’m following the guides to setup a node on the mainnet, but the catchup operation does not complete. It times out after 45 minutes approximately 2/3’s of the way through.

I’m running the run the following commands:

./update.sh -i -c stable -p ./node -d ./node/data -n
./goal node start -d data
./goal node catchup 10130000#XEJV76QTDQFXL446Q2LLFIYE5GQAFAWX3PPO4BS2JXOY5O4FMOHQ

I’ve also used nightly and it is faster, but still times out at 45 minutes.
The node status looks like the following right before the accounts process rolls over back to 0 and it enters an infinite loop eating up CPU and bandwidth.

Last committed block: 10279
Sync Time: 2696.9s
Catchpoint: 10130000#XEJV76QTDQFXL446Q2LLFIYE5GQAFAWX3PPO4BS2JXOY5O4FMOHQ
Catchpoint total accounts: 6669849
Catchpoint accounts processed: 4692480
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

My computer is an older Intel i7, 16Gb ram and the network connection is broadband.
Is there some setting in the config.json to increase this time? How can I let the catchup complete?

–WM

fabrice · November 9, 2020, 2:20am

Welcome @worrymite!

Have you tried Fast Catchup not working for our node on MainNet?

worrymite · November 9, 2020, 2:51am

Thanks @fabrice,

Yes, I tried that, but I’m not gettting download errors. I’m getting the following error:
:“warning”,“line”:344,“msg”:“dbatomic: tx surpassed expected deadline by 6.730578955s”,“name”:"",“readonly”:false,“time”:“2020-11-08T15:18:14.
046234-08:00”}

Also, I didn’t mention it in the original post, but I’m on Ubuntu 18.04.

From grepping the code I see the following line:
maxCatchpointFileDownloadDuration = 45 * time.Minute

–WM

tsachi · November 9, 2020, 2:21pm

@worrymite,

The dbatomic warning message is most likely benign; it would not prevent your from completing the catchup, however it might indicate that you have a slower than desired storage device. If you could confirm what type of storage device you have, it would be great. ( HDD, SSD, nvme, other ? ).

If you keep hitting the 45 minutes limit, that would need to be addressed at the server side ( i.e. relays ). At this time, it’s a hard-coded value, so only a new release would be able to address that.

In the meanwhile, I’d suggest you’ll try to do the “trivial” things - make sure there are no competing processes while you’re running the node, ensure all the computer available memory is available for the OS, etc.

worrymite · November 9, 2020, 4:24pm

Thanks @tsachi,

The data is being written to an SSD, no other processes were running. The memory usage is steady and it doesn’t swap out of RAM. Do I need to open up any port in my firewall to better talk to the servers or peers (I’m not clear on that part of the protocol)?

Is there any mid-point catchup data that I can use to catchup in multiple steps? The processing is fast at first up to ~2M accounts, then it slows down.

One other point that I noticed: Ubuntu 18.04’s default version of go that is installed with ‘apt install’ is too low to compile algorand. This is easily fixed by using the ‘snap’ package. Unfortunately the ‘snap’ package doesn’t have access rights to run some ‘sh’ scripts, which must be run manually. Maybe the golang version check or the snap permissions check could be part of the Makefile process?

-WM.

tsachi · November 9, 2020, 4:46pm

@worrymite,

Thanks for the storage device update. No, there is no need to open any ports; if the algod was able to start the download progress, then all the communication needs have already been met.

For security reasons, there is no mid-point catchpoint data. It downloads and calculate the hash in a streaming fashion, and when it’s done - it compares it with the one you’ve provided.
It’s expected that after a while ( i.e. 2M account ), it would become slower. The latest release should make it slightly better.

I believe that you’re correct saying that the go version that comes with Ubuntu 18.04 wouldn’t be able to build the latest version of algorand. The required major version can be found in go.mod, and the branch is built using scripts/get_golang_version.sh. I opened an issue for this https://github.com/algorand/go-algorand/issues/1690. You could keep track of this issue to see when/if it got handled.

I would suggest that you’ll try to catchup with the “official” build, just to make sure your binary was generated in the “same” pipeline as everybody else’s.

jiffe · November 9, 2020, 9:42pm

I’ve been wrestling with this problem myself, I tried with 3 different machines and none could complete fast catchup before restarting after 45 minutes. I don’t know if there are any config params that can be tuned, I tried messing with them but didn’t notice much of a difference. It seemed like storage was a bottleneck since I was seeing a lot of IO waits so I created a vm with 8G of ram and used half the ram as a tmpfs for /var/lib/algorand. With this configuration I see no disk IO, less than 50% ram allocated to this vm and around 180kbps network usage. I tried 2, 4 and 16 cores and didn’t really seem like it scaled beyond 4 cores, with 4 cores it sits around 45% cpu usage. I can get to about 4.8M accounts processed before the whole set restarts.

jiffe · November 9, 2020, 11:15pm

Some supplemental information, I built algod with maxCatchpointFileDownloadDuration set to a much higher number and I do get further, it runs longer than 45 minutes, but it still fails with “unable to download ledger : http: unexpected EOF reading trailer” and restarts.

worrymite · November 10, 2020, 4:27am

Hi @jiffe,

Sounds like some good experiments. I infer from @tsachi message that the 45 minute timeout is also enforced by the other peers/relays in the network. Here is what I’m doing and it is working so far.
I’m doing the catchup with to an older point. This is ~8.5 million blocks and the current block is 10.1 million.

./goal node catchup 8550000#MYSKTQ7KYLYLPOU275ZUBPMOBAUMI7OYZY2AZSSAJGIQQIWO4OAQ

From there, I’m letting it sync normally with the other peers. From my calculations this will take about 16 hours and at least 32Gbytes of data transferred. If it is done in 16 hours that works for me. I’ve spent much longer than that trying to get the catchup to work.

-WM

Update: This process worked for me and it took less than 16 hours.

jiffe · November 10, 2020, 3:09pm

Thanks @worrymite,

Using this catchpoint I was able to complete fast catchup and its now syncing the rest of the way.

tsachi · November 10, 2020, 5:25pm

@jiffe and @worrymite,

thanks for your the details you’ve provided. It’s really helpful to get constructive feedback like yours.
The idea of running algod with an in-memory storage was really creative one ;-).

Could you please confirm that you both were running 2.2.0 ( or equivalent ) codebase ?
I’m asking because the 2.2.0 was including few new catchpoint catchup optimization, and it would be great to get a confirmation about that.

jiffe · November 10, 2020, 7:35pm

@tsachi, I am running 2.2.0.stable

worrymite · November 10, 2020, 8:49pm

@tsachi, same Version here: 2.2.0.stable [rel/stable] (commit #c048b4ac)

-WM

tsachi · November 10, 2020, 10:32pm

thanks for confirming.

Topic		Replies	Views
Fast Catchup not working for our node on MainNet	53	5124	October 7, 2020
Fast catchup not synchronizing General	2	449	June 15, 2021
Node continues normally despite fast catchup General	5	464	January 15, 2021
Fast catchup does not work on Raspberry Pi 4B General mainnet	24	1279	May 24, 2021
Fast catchup keep on restarting General	5	574	April 6, 2022

Can't complete catchup operation on mainnet -- timeout after 45 minutes (~4m accounts)

Related topics