Fast Catchup not working on testnet

rmb · August 21, 2020, 12:52pm

I tested fast catchup on mainnet and ran into memory problems - mem utilisation appears to be proportional to the number of accounts (but I’ll come back to that in a different thread). So I thought I would try out fast catchup on the testnet.

The version is Debian 2.1.3.stable. I created a new /var/lib/algorand_testnet directory with the genesis.json and system.json files correctly permissioned and started the testnet node.

Status showed it sprang quickly into life (for some reason status -w scrolls the page on testnet but not on mainnet):

$ goal node status -d /var/lib/algorand_testnet -w 10000
Last committed block: 3924
Time since last block: 0.1s
Last committed block: 4175
Time since last block: 0.0s
Last committed block: 4501
Time since last block: 0.0s
Sync Time: 152.9s
Last consensus protocol: https://github.com/algorand/spec/tree/a26ed78ed8f834e2b9ccb6eb7d3ee9f629a6e622
Next consensus protocol: https://github.com/algorand/spec/tree/a26ed78ed8f834e2b9ccb6eb7d3ee9f629a6e622
Round for next consensus protocol: 4502
Next consensus protocol supported: true
Last Catchpoint:
Genesis ID: devnet-v1.0
Genesis hash: sC3P7e2SdbqKJK0tbiCdK9tdSpbe6XeCGKdoNzmlj0E=

I then set it for fast catchup:

$ goal node catchup 8750000#TPIUII4VX4B2OCYODFEUC6UTLLUHFV3K4RCRTUZCBPJ275KVMIWQ -d /var/lib/algorand_testnet

For a while status showed the catchpoint:

Last committed block: 5100
Time since last block: 0.0s
Last committed block: 5153
Time since last block: 0.0s
Last committed block: 5204
Time since last block: 0.3s
Last committed block: 5217
Sync Time: 7.2s
Catchpoint: 8750000#TPIUII4VX4B2OCYODFEUC6UTLLUHFV3K4RCRTUZCBPJ275KVMIWQ
Genesis ID: devnet-v1.0
Genesis hash: sC3P7e2SdbqKJK0tbiCdK9tdSpbe6XeCGKdoNzmlj0E=

But then the status changed back to:

Last committed block: 5234
Time since last block: 0.1s
Last committed block: 5282
Time since last block: 0.1s
Sync Time: 14.7s
Last consensus protocol: https://github.com/algorand/spec/tree/a26ed78ed8f834e2b9ccb6eb7d3ee9f629a6e622
Next consensus protocol: https://github.com/algorand/spec/tree/a26ed78ed8f834e2b9ccb6eb7d3ee9f629a6e622
Round for next consensus protocol: 5283
Next consensus protocol supported: true
Last Catchpoint:
Genesis ID: devnet-v1.0
Genesis hash: sC3P7e2SdbqKJK0tbiCdK9tdSpbe6XeCGKdoNzmlj0E=

A short while later, status was reporting the next block being around 7352000 range and it’s now catching up from there.

Have I misunderstood something about fast catchup? Given the catchup string was ‘8750000#TPIUII4VX4B2OCYODFEUC6UTLLUHFV3K4RCRTUZCBPJ275KVMIWQ’ I was expecting it to jump to block 8750000 and yet it is currently well over a million blocks short.

rmb · August 21, 2020, 2:47pm

Update: my testnet node claims to be caught up. Status is showing Sync time as 0.0s and the Last committed block is increasing very slowly (currently 7421295 @ 14:44 GMT).

I am puzzled, though, as GoalSeeker states that the testnet is upto block 8753681

LATEST BLOCK

8753681

Am I really being very dense today?

Tim · August 21, 2020, 3:55pm

You are on the wrong network it looks like from your status

Genesis ID: devnet-v1.0

You need to be on testnet! Stopping the node, and swapping in the correct genesis file then re-doing the catchup should work.

tsachi · August 22, 2020, 2:56am

As @Tim already suggested, the genesis file you’ve used for the network wasn’t the right one.

The reason it took so long before it “failed” is because the catchpoint between the two networks happen to align with the round numbers ( i.e. every 10K rounds ). As a result, your devnet node was downloading a catchpoint for 8750000 from the devnet relay. Than it downloaded the corresponding block, and attempted to confirm that the two are perfect match to the catchpoint hash you’ve provided (TPIUII4VX4B2OCYODFEUC6UTLLUHFV3K4RCRTUZCBPJ275KVMIWQ).

Not surprisingly, the hashes did not match between different networks, and the downloaded accounts and block were discarded. Following that, it would try again several times ( since it could be a malicious relay ! ) before giving up.

rmb · August 24, 2020, 2:14pm

Thanks @Tim & @tsachi for pointing out my error. I’m not sure how I ended up with the wrong genesis.json (I’ll put it down to hurrying iat the wrong time of day).

The fast catch-up process is working as expected (on testnet) now I have the correct genesis.json.

What is the potential risk (of node compromise or corruption) of using fast catchup if the contents of https://algorand-catchpoints.s3.us-east-2.amazonaws.com/channel/XXX/latest.catchpoint were to be maliciously modified by a bad actor? Is there the possibility of an injection attack being launched via this vector?

I know I’ve mentioned memory utilisation elsewhere but I’ll mention it again here as I’ve been observing it while I’ve been testing fast catchup. When my testnet node is processing testnet’s current 257967 catchpoint accounts, algod consumes approx 2GB memory. Once it has caught up it continues to keep the memory it grabbed. If I stop and start the testnet node then algod consumes approx 1GB memory and never any more - even during a normal catchup period and steady-state running.

If I then instruct the testnet node to use fast catchup (even though it is already fully caught up) then algod’s memory utilisation again rises to approx 2GB and it does not free up the memory until the node is stopped and started again. Is this by design or is it possible that a future release of algod could free up the memory it grabbed?

tsachi · August 25, 2020, 3:02am

@rmb,

The reason the catchpoint is not part of the protocol and was implemented “externally” is exactly for the “bad actor” issue. But don’t get me wrong here - it’s not something that can be easily manipulated. There are quite a lot of details that goes into that, but these all boils down into the catchpoint hash.

When using the existing catchup, your trust anchor is the genesis block. Anything that comes after need to be cryptographically (and recursively) verified.

Using a catchpoint breaks that model, and require you to establish a new trust anchor. The hash provided in the catchpoint label is the new trust anchor. If you’re serious about your node security, you should have a node ( could be a non-archival node as long as it has the “TrackingCatchpoint:1” in the config file ), to have these catchpoint labels figured out.

For bootstrapping purposes, we use the above technique to generate the catchpoint labels. But if you’re going to use catchpoints over and over, you could replicate that mechanism yourself.

rmb · August 25, 2020, 8:29am

Hi @tsachi,

As always, your answers are really helpful and I’ll look at the config you mentioned. I was wondering, however, why the consensus mechanisms were not used to write the catchpoint information into the chain.

If a the hash of any block committed to the chain contains the essence of the genesis block (by way of a Merkle tree) then if a new node were to begin building the chain from the newest block back towards the genesis node then it could use the following mechanisms to be reasonably sure of the integrity of what it is building:

for each block requested and received from its peers, as long as more than two thirds of the resoonses are identical then they may be tentatively assumed valid
as each older block (n) is received, the hash of the previously received block n+1 is checked and validated
when a block containing a checkpoint is received (and perhaps validated by pulling a further x preceding blocks) then the fast catchup mechanism can allow a non-archival node to catch up without continuing to pull preceding blocks all the way back to the genesis.

There are nuances to consider in the above. For example:

once fast catchup is complete, the node might begin submitting transactions but would consider itself compromised if the consensus mechanism rejected its attempts to transact (e.g. due to inconsistent transactions - I’m pretty sure this is already part of the protocol)
the node could continue to validate back towards the genesis block in the background and only attempt to become part of the consensus when it had validated all the way (this may be a step too far but would add a further integrity check into network participation)

Just a thought.

tsachi · August 25, 2020, 1:40pm

I think that you’re missing one thing in the above model -
The fact that we received a block (whether it contains a catchpoint or not), is not enough for us to trust it.
The block can only be trusted if any of the following holds true:

A previous block is known and trusted, and we can verify the committee signatures moving the trust from block X to block X+1. This is the “normal” moving forward scheme.
A future block is known and trusted. If we trust a future block, we can use the previous block hash field in block X to verify block X-1. ( this is used when the catchpoint loads the 1000 blocks, in reverse )
We have some “other” means of trusting block X. This is what the catchpoint model is using. In essence, the hash in the catchpoint label is “constructed” of both the accounts data as well as the block data ( plus some other components ).

Your idea of starting a functioning node after loading 1000 blocks is interesting, although not very beneficial for the network. From the network perspective, if the node is sending valid proposals/votes/transaction, then it’s a “good” node. It could be a good exercise to verify the entire chain in reverse, but from network operation - quite an expensive one.

rmb · August 26, 2020, 9:00am

@tsachi, I agree with your points - but I believe there is merit in using a more decentralised way to bootstrap the network and permit (fast) catchup.

Trust starts with the genesis block. I have to be sure that I have the right genesis block to begin with. I am currently trusting the algorand installation repository for the mainnet genesis block (and also a combination of the installation repository and my own fallible actions for testnet, devnet etc.).

I might be more sure that I’ve got the right mainnet genesis block if my node could query a random set of nodes purporting to be running the algorand network. My implementation might then be able to gain trust through consensus: if >66.7% of the nodes it interrogates agree on the genesis block then it would trust this subset and begin operating.

If the hash of the most recent block effectively contains the essence of the genesis block (by way of a Merkle tree through all intervening blocks) my node would have a similar level of trust in this most recent block, as it had in the genesis block, if the subset of nodes it queries also agree on this block (including its hash and the hash of its predecessor). My node might choose to augment the set of nodes it originally queried - or even choose a completely different subset of nodes to pull both the most recent block and the genesis block again.

Once this initial chain has been validated (potentially by the new node validating backwards until it has reached a checkpoint block (or two)) then the new node can join the network with reasonable confidence of accuracy (until and unless it is notified to be bad by the network rejecting submitted transactions). It could continue to validate the chain back towards the genesis as a background, low impact process (low impact on both itself and the overall network). Once it has validated back to the genesis then it will have 100% trust in the integrity of the chain on which it is operating.

Turning to other networks (e.g. testnet, devnet etc.), if the genesis block of these networks were to be stored on the first network the node validates then it would already have trust in the genesis blocks of these spawned networks due to the trust it had in the first network. Perhaps the first network it should validate ought not be the mainnet (due to its current size and use) but rather a new network that doesn’t yet exist (let’s call it genesisnet for want of a better term). If the genesisnet were reserved for aspects related to bootstrapping then it would not grow quickly, it would not consume many resources to validate it (either from tail-to-genesis or the more conventional genesis-to-tail approach) and it would create a distributed bootstrap mechanism for all other networks (either public networks or, perhaps one day, permissioned networks).

The governance structure around genesisnet would require careful consideration but its integrity could be linked to mainnet (and its financial incentives to play nice) by cross-linking transactions. In this way a genesisnet could be spawned even after the mainnet exists and the genesisnet could become the trusted bootstrap mechanism after a few rounds of network protocol upgrade.

This, in itself, would not remove the need for DNS bootstrap configurations as there needs to be a way of finding at least five operational and trustworthy nodes that can share their lists of known and trusted peers (which, in turn, can share their lists of known and trusted peers etc.) in order for a new node to begin its bootstrap process.

(Perhaps DNS, as we know it, will evolve from using centralised root authorities to use DLTs - might Algorand be part of that DNS (r)evolution? Perhaps something akin to IRDP will be adopted to allow endpoints to find and join DLTs. It might even be called IDDP - the sustainability images conjured up by “Iceland Deep Drilling Project” might be quite apt!)

Topic		Replies	Views
Node catchup doesn't seem work on the TestNet AlgoSDK testnet	2	547	March 23, 2022
Sandbox testnet fast-catchup remains at processing accounts [0/1000] 0% TestNet	12	1456	October 9, 2023
Fast Catchup not working for our node on MainNet	53	5051	October 7, 2020
Fast catchup keep on restarting General	5	568	April 6, 2022
Testnet catchup stuck TestNet	3	522	April 23, 2021

Fast Catchup not working on testnet

LATEST BLOCK

Related topics