Algod syncing very slowly

Hi,

I am trying to sync a full archival node for mainnet in order to run my own indexer. It is going very slowly. At first I was on a slow Oracle cloud instance but I have since moved to a dedicated server where network/compute/disk resources should not be a bottleneck.

The sync process has been going on since mid November and I am still at block 14736614.

At times it seems to hang and sync very, very slowly. I get these errors intermittently:

{"file":"bootstrap.go","function":"github.com/algorand/go-algorand/tools/network.ReadFromSRV","level":"info","line":43,"msg":"ReadFromBootstrap: DNS LookupSRV failed when using system resolver: no signature in DNS response for _algobootstrap._tcp.mainnet.algorand.network","time":"2022-01-23T11:08:25.289766+01:00"}

I saw elsewhere on the forums that access to 8.8.8.8 may be an issue, but this doesn’t seem to be the case for me:

root@k43 ~ # dig @8.8.8.8 _algobootstrap._tcp.testnet.algorand.network srv

; <<>> DiG 9.16.1-Ubuntu <<>> @8.8.8.8 _algobootstrap._tcp.testnet.algorand.network srv
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1903
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;_algobootstrap._tcp.testnet.algorand.network. IN SRV

;; ANSWER SECTION:
_algobootstrap._tcp.testnet.algorand.network. 300 IN SRV 1 1 4161 relay-virginia-testnet-algorand.algorand-testnet.network.
_algobootstrap._tcp.testnet.algorand.network. 300 IN SRV 1 1 80 relay-switzerland1-testnet-algorand.algorand-testnet.network.
_algobootstrap._tcp.testnet.algorand.network. 300 IN SRV 1 1 4161 londong-algorand-test-2.algorand-testnet.network.
_algobootstrap._tcp.testnet.algorand.network. 300 IN SRV 1 1 4161 r19.algorand-testnet.network.
_algobootstrap._tcp.testnet.algorand.network. 300 IN SRV 1 1 4161 singaporeg-algorand-test-1.algorand-testnet.network.
_algobootstrap._tcp.testnet.algorand.network. 300 IN SRV 1 1 4161 singaporeg-algorand-test-2.algorand-testnet.network.
_algobootstrap._tcp.testnet.algorand.network. 300 IN SRV 1 1 4161 oregong-algorand-test-1.algorand-testnet.network.
_algobootstrap._tcp.testnet.algorand.network. 300 IN SRV 1 1 4161 londong-algorand-test-1.algorand-testnet.network.

;; Query time: 12 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: So Jan 23 11:10:43 CET 2022
;; MSG SIZE  rcvd: 623

I am on Ubuntu 20.04 and I have both global and interface DNS servers set to 8.8.8.8 with DNSSEC enabled:

root@k43 ~ # systemd-resolve --status | grep DNS
MulticastDNS setting: no                  
  DNSOverTLS setting: no                  
      DNSSEC setting: yes                 
    DNSSEC supported: yes                 
  Current DNS Server: 8.8.8.8             

Possibly worth noting that I have an indexer running against algod while it is syncing (via HTTP as it is on another server). Is this recommended? Do I need "IsIndexerActive": true in config.json in this case?

IPv4 and v6 connectivity seems to be stable.

My config.json file can be found here

Last lines of node.log (excluding indexer GET requests) can be found here

algod version is 3.2.3.stable

Is there anything obviously wrong with my setup?

Thanks in advance,

Bit

I just realised that I was checking a testnet DNS record. I checked a mainnet one with dig and my default DNS resolver was returning nothing:

root@k43 ~ # dig _archive._tcp.mainnet.algorand.network SRV

; <<>> DiG 9.16.1-Ubuntu <<>> _archive._tcp.mainnet.algorand.network SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4846
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;_archive._tcp.mainnet.algorand.network.	IN SRV

;; Query time: 12 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: So Jan 23 11:25:45 CET 2022
;; MSG SIZE  rcvd: 67

I set /etc/resolv.conf to 8.8.8.8 and now it resolves:

root@k43 ~ # dig _archive._tcp.mainnet.algorand.network SRV

; <<>> DiG 9.16.1-Ubuntu <<>> _archive._tcp.mainnet.algorand.network SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20822
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;_archive._tcp.mainnet.algorand.network.	IN SRV

;; AUTHORITY SECTION:
algorand.network.	428	IN	SOA	scott.ns.cloudflare.com. dns.cloudflare.com. 2268030982 10000 2400 604800 3600

;; Query time: 4 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: So Jan 23 11:27:59 CET 2022
;; MSG SIZE  rcvd: 130

I suppose this is partly resolved as it is now syncing a bit faster, but the question stands: should this be taking so long? It’s been 2 months…
Spoke to soon, the DNS errors are still there:

{"file":"bootstrap.go","function":"github.com/algorand/go-algorand/tools/network.ReadFromSRV","level":"info","line":43,"msg":"ReadFromBootstrap: DNS LookupSRV failed when using system resolver: no signature in DNS response for _archive._tcp.mainnet.algorand.network","time":"2022-01-23T11:34:25.343841+01:00"}

My system resolver now seems to be 8.8.8.8 and it resolves the SRV record of that hostname:

dig _archive._tcp.mainnet.algorand.network SRV

; <<>> DiG 9.16.1-Ubuntu <<>> _archive._tcp.mainnet.algorand.network SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55711
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;_archive._tcp.mainnet.algorand.network.	IN SRV

;; AUTHORITY SECTION:
algorand.network.	1192	IN	SOA	scott.ns.cloudflare.com. dns.cloudflare.com. 2268030982 10000 2400 604800 3600

;; Query time: 8 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: So Jan 23 11:33:43 CET 2022
;; MSG SIZE  rcvd: 130

Also, when I run goal node status it can take more than a minute for the response to come back:

$ date; goal node status; date
So 23. Jan 11:36:56 CET 2022
Last committed block: 14736966
Time since last block: 553.1s
Sync Time: 23740.4s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/d050b3cade6d5c664df8bd729bf219f179812595
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/d050b3cade6d5c664df8bd729bf219f179812595
Round for next consensus protocol: 14736967
Next consensus protocol supported: true
Last Catchpoint: 14730000#4MWG65KJIUYUTYGIN2XDJCAII6MFEUSKN7BXR7PR3OB3E7TYKTGA
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
So 23. Jan 11:38:05 CET 2022

I disabled DNSSecurityFlags": 0, in config.json and I am now getting no such host:

{"file":"bootstrap.go","function":"github.com/algorand/go-algorand/tools/network.ReadFromSRV","level":"info","line":43,"msg":"ReadFromBootstrap: DNS LookupSRV failed when using system resolver: lookup _archive._tcp.mainnet.algorand.network on 8.8.8.8:53: no such host","time":"2022-01-23T12:45:00.346820+01:00"}

When I get “no such host” in the algod logs I can still look up the host with dig and it works properly:

root@k43 ~ # dig  _archive._tcp.mainnet.algorand.network srv +dnssec

; <<>> DiG 9.16.1-Ubuntu <<>> _archive._tcp.mainnet.algorand.network srv +dnssec
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14951
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 4, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 512
;; QUESTION SECTION:
;_archive._tcp.mainnet.algorand.network.	IN SRV

;; AUTHORITY SECTION:
algorand.network.	580	IN	SOA	scott.ns.cloudflare.com. dns.cloudflare.com. 2268030982 10000 2400 604800 3600
algorand.network.	580	IN	RRSIG	SOA 13 2 3600 20220124122448 20220122102448 34505 algorand.network. AsW0qZAryY9a91/ztN1yXV/BqqJNzCHGsGIDLi1DfYKGJB6wrzgv7YI5 lFl3F2sx+blAKsOgY1br8Nup0Q/yZw==
_archive._tcp.mainnet.algorand.network.	2379 IN	NSEC \000._archive._tcp.mainnet.algorand.network. RRSIG NSEC
_archive._tcp.mainnet.algorand.network.	2379 IN	RRSIG NSEC 13 5 3600 20220124122448 20220122102448 34505 algorand.network. 6xDeh5iLldLd8OQ+ZbJDDgQudH8gis0FfHh1EGgENT0QiRPHpuHlOZjd gKy0bkrXu4cvvOayxLzIMwy8JpNpFQ==

;; Query time: 4 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: So Jan 23 12:45:08 CET 2022
;; MSG SIZE  rcvd: 416

Other things I have checked:

CPU utilization on the box is low

Network bandwidth is nowhere near saturated (a few MB/s with speedtest of 100MB/s up+down)

Disk utilisation is somewhat high.

I am syncing ~8.2 blocks per minute

Regarding the IsIndexerActive :
You’d like to set it to false. This option is practically deprecated.

Also, if the catchup is making any progress, than leave the DNS messages alone; that’s not the droid you’re looking for.

You’d like to set it to false. This option is practically deprecated.

Thanks, I figured it is a leftover from when algod had the indexer integrated. I have it as false.

Also, if the catchup is making any progress, than leave the DNS messages alone; that’s not the droid you’re looking for.

It is making some progress but it seems to stall for tens of minutes at a time.

Any thoughts on what I should prod? At this rate I don’t think it will catch up

Could you post here the latest 50 lines from the node.log file that contains the word ‘catchup’ ?

Thanks for the attention! Uploaded here.

I havn’t seen anything bad going on there… could you share with the forum what type of storage and compute you’re using ?

Sure,

CPU: Intel(R) Core™ i7-4770 CPU @ 3.40GHz

RAM: 32GB ECC

HDD: 2x WD WD2000FYYZ-0 2GB 7200RPM in md/raid 1

Sorry to press the DNS issue again, but are the “no such host” errors common/to be expected? Getting a lot of these:

ReadFromBootstrap: DNS LookupSRV failed when using system resolver: lookup _archive._tcp.mainnet.algorand.network on 8.8.8.8:53: no such host

According to goal node status, right now it seems to be stuck:

$ goal node status
Last committed block: 14740128
Time since last block: 13011.7s
Sync Time: 24677.5s

Current logs here

edit: current telemetry metrics pretty-printed are here. Do these look right?

regarding _archive._tcp.mainnet.algorand.network, yes, this is expected. It’s part of the algod support for block archivers, which have not been enabled yet. The only negative consequence is that your node might be emitting these benign log messages from time to time.

As for your hardware spec, I suspect that the HDD is the culprit. Regardless of the raid 1 configuration, HDD are generally far too slow to support the current load on mainnet. I’d suggest you’d select a hosting service that support a local nvme drive.

I understand that the HDD is not ideal and performance may be somewhat degraded but in this case it seems to have been stuck not getting a new block for ~4 hrs (15431 sec).

I looped goal node status every 10 minutes after I saw it getting stuck and I saw this (clipped to include relevant info):

So 23. Jan 19:38:37 CET 2022
Last committed block: 14740128
Time since last block: 13631.0s

So 23. Jan 19:48:37 CET 2022
Last committed block: 14740128
Time since last block: 14231.0s


So 23. Jan 19:58:37 CET 2022
Last committed block: 14740128
Time since last block: 14831.1s
Sync Time: 26496.9s


So 23. Jan 20:08:37 CET 2022
Last committed block: 14740128
Time since last block: 15431.1s
Sync Time: 27096.9s

So 23. Jan 20:21:50 CET 2022
Cannot contact Algorand node: Get "http://[::]:$myport/v2/status": EOF

So 23. Jan 20:33:18 CET 2022
Last committed block: 14740503
Time since last block: 687.8s
Sync Time: 28577.6s

So 23. Jan 20:45:12 CET 2022
Last committed block: 14741206
Time since last block: 239.1s

I should have mentioned that the architecture of the original algod node was different (aarch64) than the current one (x86) - I switched after ~12M blocks or so - if memory serves.

Could this be contributing to these issues?

Looks like it is syncing, albeit slowly. Not sure why it got stuck before but it seems over it. I wrote a script to monitor progress:

#!/bin/bash

sleep_duration=20m

while true; do
	block=$(goal node status | grep Last\ commit | awk '{print $4}')
	if [ -z "$block" ]; then
		echo "no block"
		sleep $sleep_duration
		continue;
	fi
	epoch=$(date +%s)
	date=$(date --utc)
	if [ -z "$lastblock" ]; then
		diff=0
		speed=0
	else
		diff=$((block - lastblock))
		elapsed=$((epoch - lastepoch))
		speed=$((60 * diff / elapsed))
	fi
	echo "$epoch $date $block +$diff $speed/minute"
	lastblock=$block
	lastepoch=$epoch 
	sleep $sleep_duration
done

in case it is useful to anyone else.

My output is:

$ bash print-algod-progress.sh | tee -a algod-progress.log
1643121207 Di 25. Jan 14:33:27 UTC 2022 15014661 +0 0/minute
1643122417 Di 25. Jan 14:53:37 UTC 2022 15017662 +3001 148/minute
1643123619 Di 25. Jan 15:13:39 UTC 2022 15020677 +3015 150/minute
1643124823 Di 25. Jan 15:33:43 UTC 2022 15024672 +3995 199/minute
1643126057 Di 25. Jan 15:54:17 UTC 2022 15027858 +3186 154/minute
1643127262 Di 25. Jan 16:14:22 UTC 2022 15030774 +2916 145/minute
1643128471 Di 25. Jan 16:34:31 UTC 2022 15033943 +3169 157/minute
1643129675 Di 25. Jan 16:54:35 UTC 2022 15037121 +3178 158/minute
1643130876 Di 25. Jan 17:14:36 UTC 2022 15040375 +3254 162/minute
1643132076 Di 25. Jan 17:34:36 UTC 2022 15043735 +3360 168/minute
1643133280 Di 25. Jan 17:54:40 UTC 2022 15047366 +3631 180/minute
1643134484 Di 25. Jan 18:14:44 UTC 2022 15050383 +3017 150/minute
1643135688 Di 25. Jan 18:34:48 UTC 2022 15053128 +2745 136/minute

With algo at ~13 blocks/min it seems that it will finish sync in ~19 days. I can live with that but gosh, wasn’t expecting this endeavour to last from November to February!

Maybe a warning about HDD/SSD/NVMe would be helpful in the algod docs.

Since you are syncing an indexer at the same time (also since November?), did you see the announcement to update the indexer to 2.8.0, which requires reindexing everything from scratch? That’ll take weeks probably (it does for me) and I can’t wrap my head around the fact that there is just no way to speed this up (like, you know, have several instances to distribute the work for example).

Since you are syncing an indexer at the same time (also since November?),

Yes I was, but stopping the indexer had negligible effects on algod performance.

Thanks for the heads up! I did indeed restart indexing with 2.8.0 as it promises “Enhancements to import performance.”

With the previous indexer I could only manage 10-15 blocks a minute (at block height ~15M). So it would take be between 1-2 standard Eras to complete.


Regarding speeding up algod on slow disks:

A good option to save disk I/O is

    "CadaverSizeTarget": 0,

This disables the cadaver file, which is only useful to debug the agreement protocol, per this question

Another tip that helped me a bit (I think - hard to say as there is variance in the speed between blocks anyhow) was to move the node.log file to a ramdisk and symlink it to the data dir, so that the log is written to RAM and doesn’t compete for I/O.

Something else I changed that I hope won’t bite me in the behind:

LedgerSynchronousMode: 0

This does:

	// LedgerSynchronousMode defines the synchronous mode used by the ledger database. The supported options are:
	// 0 - SQLite continues without syncing as soon as it has handed data off to the operating system.
	// 1 - SQLite database engine will still sync at the most critical moments, but less often than in FULL mode.
	// 2 - SQLite database engine will use the xSync method of the VFS to ensure that all content is safely written to the disk surface prior to continuing. On Mac OS, the data is additionally syncronized via fullfsync.
	// 3 - In addition to what being done in 2, it provides additional durability if the commit is followed closely by a power loss.
	// for further information see the description of SynchronousMode in dbutil.go

My full config can be found here

… the things we do.

1 Like