Hi there,
what are the factors that influence the synchronization time of a node with the Mainnet?
Hi there,
what are the factors that influence the synchronization time of a node with the Mainnet?
Cusma, talking to the guys on Telegram. It looks like it takes about 3 days to sync now. Mostly just depends on your internet speed and how many active connections you have to other nodes.
I’m currently synchronising a node that has a 10 year old AMD Athlon 64 X2 QL-54 processor and it’s being throttled by the CPU and not the network connectivity. Whilst both cores are running close to 100% utilsation it is averaging less that 0.3 MiB/s of a 60Mbps internet connection.
The question @cusma asks raises and interesting point that will become more of an issue in the future…
At the time of this post, the mainnet currently consists of a bit over 8 million blocks. As the blockchain continues to grow, how long will it take to synchronise a new node in 1-year, 5-year, 10-year’s time - and how much data will traverse the network just to get a new node to the point of synchronisation?
@rmb,
The time to perform a catchup with the current implementation would be linear with the number of blocks need to be downloaded & validated. ( i.e. the downloaded is equivalent to network bandwidth and the validation is equivalent to CPU time ).
Reviewing the GitHub repository, it seems that there is some work there which attempts to address some of these concerns, but we’ll have to wait for the official release to find out.
Also, if all you want is a non-archival node, you can grab a sandbox and copy the data directory from there onto your local machine. I haven’t attempted to do that in person, but I believe it would be a viable alternative to a long wait.
Hi @tsachi, I noticed the fast catchup in the GitHub repository and look forward to hearing more about it.
My node is currently catching up at an averaging of around 61K block per hour (~17 blocks per second) and is not limited by network bandwidth. The mainnet is currently a little over 8.1 million blocks which give my machine an estimated time to catch up of nearly 133 hours (5 and a half days).
Granted that my machine is a relatively low spec and 10 years old so according to Moore’s law I might expect a similar machine purchased today to be processing at a rate of around 29 times faster. (assuming 40% annual growth rate). It is currently consuming around 300 KiB/s so the increased processing rate would take it up to around 9 MiB/s (which is well within the upper limit of my network connection).
Assuming that it will not be bound by network bandwidth, then could I really expect my node to take just 4.6 hours to synchronise if I were to deploy onto a recently purchased mid-spec laptop?
If so then why is this comment in the Docker Sandbox section “This method is recommended if you […] but can’t wait days for your node to catchup”
I am making a few assumptions here. E.g.
What experience has anyone had synchronising a server-grade, high-bandwidth/cloud-hosted node?
What is a realistic number of blocks per hour it can process while catching up?
My initial question in this thread was about how long it might take to synchronise a node in 1-year, 5-years, or even 10-year’s time.
GoalSeaker is currently showing the average blocktime to be around 4.4 seconds. At this rate, in a year’s time we can expect the mainnet to be around 7 million blocks longer than it is right now - meaning that my current node would take a further ~115 hours (~4.8 days) to synchronise from a standing start. (My mythical modern processing node would take a further 4 hours.)
This suggests that the processing effort required to synchronise a node is out-pacing Moore’s law (perhaps Moore’s law will not be true moving forwards).
Feel free to criticize my assumptions and I look forward to hearing people’s real-world experience of this.
@rmb,
Moore’s law ( or its Intel interpretation of it… ) saying the the compute power would double itself every so often. It doesn’t say anything about a particular application being capable of utilizing that growth of compute power. ( for instance, newer intel cpus have better vector-oriented instructions. If you were to compare a previous processors to newer ones, you would find that by using the newer instructions you can achieve better performance. but you need to have your binary optimized for that processor ).
Realistically, while a newer computer would be able to (usually) verify blocks faster, having more and more blocks to verify would make it a loosing battle. I.e. I don’t expect that the next-year processors would be fast enough to double the block verification rate, regardless of what optimization we will make to the blocks verification.
If an Algorand node of 10 years from now would want to catch up, it would either need a 10 times more compute power, or to use a different algorithm to achieve that.
@tsachi, I agree with you.
I’m in the early phase of planning implementation, BCP and DR strategies. Several hours (let alone several days) down-time to resync a node could be quite costly for some organisations. This needs to be weighed against the cost of operating multiple nodes and /or using an API service as a contingency.
Do you know of anyone who has experience of synchronising a server-grade, high-bandwidth/cloud-hosted node?
What is a realistic number of blocks per hour it can process while catching up?
One option is to take regular snapshots of the Algorand data folder. If you trust where you store snapshots, you can start from these snapshots. This makes catchup very fast.
What I am not sure about is whether you can take the snapshot while algod
is running or whether it is best to stop algod
when taking the snapshot. (Using copy-on-write filesystems, the interruption can be very brief, but still not completely negligible depending on your load.) @tsachi Do you know whether snapshotting can be done while algod
is running?
@rmb,
It’s hard to advocate for any “good” numbers. As you’ve experienced, both bandwidth and compute power are factors in the practical time it takes to catch up. Generally speaking, using a computer ( host ) that is running in a high-bandwidth datacenter could accelerate the process dramatically.
I would recommend against taking a snapshot while the system is running.
sqlite has it’s own “backup” mechanism that can be used to create a backup of a running database, however, given that you’ll need to backup both the blocks and tracker database it might not be possible to obtain a lock on both of them “atomically”.
@tsachi Hopefully someone will spot this thread in the near future and share their experience of starting up a node in such an environment.
@fabrice,
You must have been reading my mind! I’m looking at what it will take to productionise the environment. [Down]time is money (as they say) and if it takes several days to sync a new node then this has an implication on the minimum viable environment and recovery strategy.
If a node were to suffer a failure then traditional backup methodologies would appear to offer an approach to restoring a node - rather than resyncing it from scratch.
@tsachi,
If there is no viable mechanism to snapshot a node while it is running then this is a major drawback - but not insurmountable. Clearly it would be possible to stop the node, take a snapshot and start the node again but it would be preferable if algod were able to output a replica copy of its databases on receipt of a signal (e.g, on reaching a particular block number). Then traditional backup technologies could move a copy off-node ready for any future restoration.
Ideally I would want the databases (and all other relevant files) to have checksum files generated (and ultimately replicated to a WORM) so that the backup is tamper evident.
The procedure above will need some refining e.g. with a cluster of three (minimum) nodes performing the same activity and comparing the hash values from the snapshots. This would potentially allow a new node to be bootstrapped in a secure manner from the backup copies prior to bringing it into the network - at which point it would either continue synchronizing from where it was or throw an exception and terminate due to an inconsistency.
If the WORM used to store the snapshot hash files were to be the mainnet then I could envisage interorganisational cooperation for the restoration strategy. (Though this might be a step too far for some who would wish to be in full control of their own backups.)
@fabrice,
I see you have been active in the sandbox development. Might it be possible to extend the sandbox concept so that it could be used to create production-ready nodes?
I have not considered the backup on live system as a required system feature. Could you please share more information regarding which nodes you would like to back up ( archival nodes vs non-archival nodes ), and for what purpose.
My goal here is to see if backing up is really the right solution or not.
Keep in mind, that if you’ll be having 3 different nodes, each one of them would have a different blocks database since the votes each one of these would see might arrive at a different order. ( the blocks database stores the firsts votes for every block that reaches the vote threshold; other votes are omitted ).
The blocks paysets ( transactions ) themselves are naturally identical, and the balances for any particular round are also identical.
@tsachi,
Sorry for the delay - I guess you know what it is like! Anyway it’s always wise to step back from something and critique whether the right route is being considered, so I appreciate your response.
The scenario…
Consider a consortium requiring a private blockchain implementation (private for reasons we will not explore here). The implementation may probably be a hybrid with some transactions ending up on the mainnet, but the important point here is that there may be one or more private chains.
A minimum viable environment will require a minimum number of nodes…
Perhaps the minimum number of relays nodes (for the private chain) might be defined as 5, split across a minimum of 3 participating organisations. With such a small number of relay nodes then the need to get one back on its feet quickly after a failure/corruption/attack is quite important to the overall resilience of the implementation. A back-up solution that allows rapid recovery for the relay nodes would be desirable. The alternative is to operate a greater number of relay nodes creating a greater level of intrinsic resilience, but which is accompanied by more significant operating costs.
Now turning our attention to archival nodes. It is likely that all major players within the consortium will wish to have archival nodes for the private chain as it will contain data that has business value. It is possible that the players will wish to run business-critical analytics derived from data on the private chain. The loss of an archival node due to failure/corruption/attack could have significant operational implications and therefore a mechanism to recover it quickly would be highly desirable.
As for the participation nodes: For the purpose of this argument I will consider their role as application gateways (irrespective of whether they are also actively participating in consensus). As an application gateway, a node provides an API to client devices to allow past transactions to be interrogated and new transactions to be posted. For operational control I will assume that each organisation participating in the consortium will expect their own devices/applications to communicate with their own participation nodes.
The required performance capacity might be calculated with an n+1 approach. If this results in more than 2 nodes being required then housing them in separate data centres becomes more challenging for an organisation with a traditional 2-DC approach. A DR plan would require 2n nodes equally split between DCs and if one DC is likely to be down for a prolonged period of time then a conservative approach would be to have 2(n+1) nodes. This is all starting to sound costly.
Now if we turn to the capacity profile - what if it is very peaky? Perhaps the total number of nodes being defined will actually only be required for a week or two every year! If this were predictable (like an election or a fiscal year end) then bringing new nodes into operation the traditional way could be planned. But what if the load is unpredictable. The ability to quickly spin up a fully synchronised new node in near-real time starts to look indispensable.
It looks possible to me that if a node were to be taken offline (from a consensus perspective) and the algod process stopped then it should be possible to make a backup from which it could be restored.
Whilst this would be a very traditional three-generation approach to backup it would facilitate a relatively speedy recovery. Re-syncing a day’s worth of transactions shouldn’t take too long.
A more integrated checkpoint/snapshot mechanism would be desirable though.
It would be interesting to know whether this backup/recovery strategy could be used to clone other nodes (with appropriate config changes before starting algod). Would it undermine the integrity of the network if the x% of the nodes were to be cloned?
If there is another way to look at this then I am open to suggestions.
Hi @rmb,
Thank you for the elaborated answer. This really help me to try and provide the most suitable solution for your needs.
One of the challenges in backing a node ( regardless of its function ) is that it’s ever changing. The node data keep changing very rapidly to reflect the latest state of the network. This has few implications:
Given that we know all the above, we can go ahead and design a backup strategy that would meet all these contains.
In your detailed design, you defined three classes of nodes ( relays, participation and archival ). Given that all relays are currently also archival nodes, I’m going to skip the “archival” ones. The only difference today between an archival node and a relay node is its configuration. The internal data storage is identical.
For creating a backup for the relays/archival nodes, I suggest that you’ll run another node which would be configured as archival node. This node does not need to be exposed to the internet, and would just “catch-up”. The host which runs the node would stop it once every hour, backup it’s content, and resume it’s operation. You would need to save only two of the latest copies at any given time.
The two copies would allow you to restore a node while another copy is being created. Depending on the time it takes you to upload the local archive, you would be able to configure the frequency of the backups.
Note that for emergency cases, you would also be able to use the above hosts as a cold standby relay.
As for backing up a non-archival node ( i.e. participation node ), please ask me that next week. We have a planned release coming soon, and I believe that one of the features there would be able to assist you with that.
Hi @rmb,
Just a small remark complementing @tsachi’s great response:
you should never backup participation keys held by participation nodes.
See the important warning on top of https://developer.algorand.org/docs/run-a-node/participate/generate_keys/
Hi all!
If I have a running relay node and I point another node to use it with the -p
switch, does have any advantage? (for sync’ing)
Regards.
Hi @tsachi,
Thanks for your great answer. I can plan to use this solution for all nodes (even the participation nodes can be run as archival nodes - it will just require more disk space, which is relatively cheap compared with the time to synchronise). Using a “hidden” node specifically implemented to create the backups will have the added benefit of creating an additional layer of protection from any adversaries that might wish to disrupt operations.
As the “backup” creation node will not be participating in consensus, there is no possibility of backing up participation keys (thanks @fabrice for calling this out).
I look forward to hearing more about the feature you refer to in the upcoming release. Perhaps we’ll be able to containerise a participation node and spin them up on demand as a microservice.
@tsachi, I see that we now have docoumentation about “Sync Node Network using Fast Catchup” . I’ll be testing this out over the coming days.
https://developer.algorand.org/docs/run-a-node/setup/install/
If it is of any interest to the forum members, my ancient machine with a sub-optimal storage device took just 12 minutes to catchup after being down for 24 hours.
It is just a couple of hundren thousand blocks short of being sync’d to the mainnet and I’ll plan to test the catchup time after being down 24 hours. Perhaps this will help others plan their MVE and BCP configuration.