Fast Catchup not working for our node on MainNet

Because I need to specify the storage path, if rpm is used, even if ALGORAND_DATA is set, it is useless

No, it should not matter whether you used the wget method or the rpm. I have explicitly mentioned the above just to make sure someone else would be able to repeat that.

Regarding the “Catchpoint total blocks” and “Catchpoint downloaded blocks” that you seems to be surprised to see: these would appear once the following three are complete:

  1. all the accounts were downloaded and processed
  2. the first block ( the one that matches round 9890000 in my example ) was downloaded
  3. the hash of the catchpoint matches the accounts and the first block, combined.

You mentioned that you’re using a SATA hard drive. Is that an SSD drive or magnetic drive ?
The reason I’m asking is that large portion of the time is being spent on writing these 5.6M accounts to disk. Using a slower disk could extend that even further.

This is why I was asking also about the duration. I don’t know if the timeout that you’re seeing is client side or server side. Having the elapsed duration could help me determine that.

SATA is a mechanical hard disk, which is slower than SSD, but I don’t think it’s a hard disk problem, because I have waited long enough (two days), all accounts are processed repeatedly, and there is no err log, I will post Part of node.log

I’ll try to get a system with a mechanical hard disk to repeat the experiment. I don’t know if this is the issue or not.

I suggest you also use wget to install it, I’m not sure if it’s the installation method

I’m almost sure it is the problem of the installation method. I installed with rpm and then used Quick Catch, all indicators are correct
image

The quick catch-up of the rpm installation method has been successful. The quick catch-up of the wget method should indeed have some problems. I hope that its full node synchronization is okay. I have synchronized the full node for 4 days. If there is a problem, the mentality will collapse.

1 Like

If this can help: also in my case I used wget method for the installation (and I had fast catch-up issues). I didn’t try directly the ARM installation yet to check if fast catch-up works in this way.

There is no problem with the rpm quick installation method. At the same time, I completed the full node synchronization of wget. It took a week, the synchronization speed of the indexer was slow, and it took more than a week

@cusma and @shixiaofeia - the recent release (2.1.5) includes few changes that would improve the performance of the fast catchup, both in term of memory consumption as well as the time it takes to catch up.

I have documented the issue here: https://github.com/algorand/go-algorand/issues/1558

@shixiaofeia and others, can you confirm this is the error message that appears in the logs after the “downloaded blocks” reaches 1000?

{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).run","level":"warning","line":201,"msg":"catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts","name":"","time":"2020-09-26T12:41:26.174291+02:00"}

@Thireus, thanks for your detailed issue you’ve created. This ticket is very well written, and I really appreciate that.

The issue, I believe, is that your client is receiving more than 1000 errors during the download of the blocks. This is not expected. There should be no failures.

The retry mechanism was designed to allow certain number of relays to be unresponsive while letting the fast catchup work regardless. It seems that in your case, that retry threshold doesn’t align with what is “visible” from your endpoint. ( note that I’m not implying that you and/or your setup is wrong/incorrect ).

While the source of the issue still need to be addressed ( i.e. verifying that all the relays are serving the blocks correctly, without generating any errors ), my suggestion to you would be to increase the retry count by adjusting the corresponding entry in the config.json file. That should allow you to complete the catchup process.

Please let us know if it resolved the issue for you.
( note that I’m answering here since I’d like to have that information available for everyone )

Hi @tsachi,

I appreciate the quick and detailed answer. It worked after I bumped the default “CatchupBlockDownloadRetryAttempts” from 1000 to 100000!

Last few seconds of fast catchup:

$ goal node status -d /var/lib/algorand -w 300
Last committed block: 4147581
Sync Time: 1386.0s
Catchpoint: 9310000#KZLT572KA72JS5GJR4A6FZQLOI6O72IHJQ6V2I5ZS2VHO7BUJYVQ
Catchpoint total accounts: 5896700
Catchpoint accounts processed: 5896700
Catchpoint total blocks: 1000
Catchpoint downloaded blocks: 917
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

Fast catchup completed:

$ goal node status -d /var/lib/algorand -w 300
Last committed block: 9310070
Time since last block: 0.5s
Sync Time: 4.3s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/3a83c4c743f8b17adfd73944b4319c25722a6782
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/3a83c4c743f8b17adfd73944b4319c25722a6782
Round for next consensus protocol: 9310071
Next consensus protocol supported: true
Last Catchpoint: 
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

I was also able to identify which nodes often return error 400:

r-rh.algorand-mainnet.network:4160 --> failure rate is ~87%
r-ti.algorand-mainnet.network:4160 --> failure rate is ~82%

Evidence:

cat /var/lib/algorand/node.log | grep -E 'algorand-mainnet.network:4160|http block fetcher response status code' | grep -B1 'http block fetcher response status code : 400' > /tmp/algofail.txt

cat /tmp/algofail.txt | grep -o '[a-z-]*.algorand-mainnet.network:4160' | sort | uniq -c
      1 r-ac.algorand-mainnet.network:4160
      1 r-ba.algorand-mainnet.network:4160
      1 r-bk.algorand-mainnet.network:4160
      1 r-cr.algorand-mainnet.network:4160
      1 r-eu.algorand-mainnet.network:4160
      1 r-gd.algorand-mainnet.network:4160
      1 r-hf.algorand-mainnet.network:4160
     26 r-rh.algorand-mainnet.network:4160
     18 r-ti.algorand-mainnet.network:4160
      1 r-zr.algorand-mainnet.network:4160

cat /var/lib/algorand/node.log | grep -E 'r-ti.algorand-mainnet.network:4160|r-rh.algorand-mainnet.network:4160' | grep 'networkFetcher.FetchBlock: asking client' | grep -o '[a-z-]*.algorand-mainnet.network:4160' | sort | uniq -c
     30 r-rh.algorand-mainnet.network:4160
     22 r-ti.algorand-mainnet.network:4160

Is there any chance whoever owns those two instances or load balancers can have a look at them?

Additional errors (note that there are other nodes that give similar messages):

{"file":"wsNetwork.go","function":"github.com/algorand/go-algorand/network.(*WebsocketNetwork).tryConnect","level":"warning","line":1842,"msg":"ws connect(ws://r-ti.algorand-mainnet.network:4160/v1/mainnet-v1.0/gossip) fail - bad handshake, precondition failed : 'mismatching genesis ID'","name":"","time":"2020-09-26T23:28:57.224613+02:00"}
{"file":"wsNetwork.go","function":"github.com/algorand/go-algorand/network.(*WebsocketNetwork).tryConnect","level":"warning","line":1842,"msg":"ws connect(ws://r-ti.algorand-mainnet.network:4160/v1/mainnet-v1.0/gossip) fail - bad handshake, precondition failed : 'mismatching genesis ID'","name":"","time":"2020-09-27T00:23:57.219086+02:00"}

Thank you.

We adjusted relay records to remove some that were showing errors at the end of the month, so I would be interested to hear if this is still an issue for people.