I’m following the guides to setup a node on the mainnet, but the catchup operation does not complete. It times out after 45 minutes approximately 2/3’s of the way through.
I’ve also used nightly and it is faster, but still times out at 45 minutes.
The node status looks like the following right before the accounts process rolls over back to 0 and it enters an infinite loop eating up CPU and bandwidth.
My computer is an older Intel i7, 16Gb ram and the network connection is broadband.
Is there some setting in the config.json to increase this time? How can I let the catchup complete?
Yes, I tried that, but I’m not gettting download errors. I’m getting the following error:
:“warning”,“line”:344,“msg”:“dbatomic: tx surpassed expected deadline by 6.730578955s”,“name”:"",“readonly”:false,“time”:“2020-11-08T15:18:14.
046234-08:00”}
Also, I didn’t mention it in the original post, but I’m on Ubuntu 18.04.
From grepping the code I see the following line:
maxCatchpointFileDownloadDuration = 45 * time.Minute
The dbatomic warning message is most likely benign; it would not prevent your from completing the catchup, however it might indicate that you have a slower than desired storage device. If you could confirm what type of storage device you have, it would be great. ( HDD, SSD, nvme, other ? ).
If you keep hitting the 45 minutes limit, that would need to be addressed at the server side ( i.e. relays ). At this time, it’s a hard-coded value, so only a new release would be able to address that.
In the meanwhile, I’d suggest you’ll try to do the “trivial” things - make sure there are no competing processes while you’re running the node, ensure all the computer available memory is available for the OS, etc.
The data is being written to an SSD, no other processes were running. The memory usage is steady and it doesn’t swap out of RAM. Do I need to open up any port in my firewall to better talk to the servers or peers (I’m not clear on that part of the protocol)?
Is there any mid-point catchup data that I can use to catchup in multiple steps? The processing is fast at first up to ~2M accounts, then it slows down.
One other point that I noticed: Ubuntu 18.04’s default version of go that is installed with ‘apt install’ is too low to compile algorand. This is easily fixed by using the ‘snap’ package. Unfortunately the ‘snap’ package doesn’t have access rights to run some ‘sh’ scripts, which must be run manually. Maybe the golang version check or the snap permissions check could be part of the Makefile process?
Thanks for the storage device update. No, there is no need to open any ports; if the algod was able to start the download progress, then all the communication needs have already been met.
For security reasons, there is no mid-point catchpoint data. It downloads and calculate the hash in a streaming fashion, and when it’s done - it compares it with the one you’ve provided.
It’s expected that after a while ( i.e. 2M account ), it would become slower. The latest release should make it slightly better.
I believe that you’re correct saying that the go version that comes with Ubuntu 18.04 wouldn’t be able to build the latest version of algorand. The required major version can be found in go.mod, and the branch is built using scripts/get_golang_version.sh. I opened an issue for this https://github.com/algorand/go-algorand/issues/1690. You could keep track of this issue to see when/if it got handled.
I would suggest that you’ll try to catchup with the “official” build, just to make sure your binary was generated in the “same” pipeline as everybody else’s.
I’ve been wrestling with this problem myself, I tried with 3 different machines and none could complete fast catchup before restarting after 45 minutes. I don’t know if there are any config params that can be tuned, I tried messing with them but didn’t notice much of a difference. It seemed like storage was a bottleneck since I was seeing a lot of IO waits so I created a vm with 8G of ram and used half the ram as a tmpfs for /var/lib/algorand. With this configuration I see no disk IO, less than 50% ram allocated to this vm and around 180kbps network usage. I tried 2, 4 and 16 cores and didn’t really seem like it scaled beyond 4 cores, with 4 cores it sits around 45% cpu usage. I can get to about 4.8M accounts processed before the whole set restarts.
Some supplemental information, I built algod with maxCatchpointFileDownloadDuration set to a much higher number and I do get further, it runs longer than 45 minutes, but it still fails with “unable to download ledger : http: unexpected EOF reading trailer” and restarts.
Sounds like some good experiments. I infer from @tsachi message that the 45 minute timeout is also enforced by the other peers/relays in the network. Here is what I’m doing and it is working so far.
I’m doing the catchup with to an older point. This is ~8.5 million blocks and the current block is 10.1 million.
From there, I’m letting it sync normally with the other peers. From my calculations this will take about 16 hours and at least 32Gbytes of data transferred. If it is done in 16 hours that works for me. I’ve spent much longer than that trying to get the catchup to work.
-WM
Update: This process worked for me and it took less than 16 hours.
thanks for your the details you’ve provided. It’s really helpful to get constructive feedback like yours.
The idea of running algod with an in-memory storage was really creative one ;-).
Could you please confirm that you both were running 2.2.0 ( or equivalent ) codebase ?
I’m asking because the 2.2.0 was including few new catchpoint catchup optimization, and it would be great to get a confirmation about that.