Participation node catchpoint failing midway

Hi team, need help. I run a participation node on Ubuntu 18.04 LTS. Algorand packages install all OK. Catchpoint update runs really well but then fails almost midway.

Observed following logs in node.log file that could indicate an issue.

{“file”:“bootstrap.go”,“function”:“github.com/algorand/go-algorand/tools/network.ReadFromSRV",“level”:“info”,“line”:43,“msg”:"ReadFromBootstrap: DNS LookupSRV failed when using system resolver: no answer for (_algobootstrap._tcp.mainnet.algorand.network., 33) from DNS servers [127.0.0.53:53]”,“time”:“2021-10-02T11:32:37.516836Z”}

{“event”:“Disconnected”,“file”:“wsNetwork.go”,“function”:“github.com/algorand/go-algorand/network.(*WebsocketNetwork).removePeer”,“level”:“info”,“line”:2211,“local”:“http:”,“messageDelay”:824321043,“msg”:“Peer r-fm.algorand-mainnet.network:4160 disconnected: LeastPerformingPeer”,“name”:"",“remote”:“r-fm.algorand-mainnet.network:4160”,“time”:“2021-10-02T11:28:35.956508Z”}

{“details”:{“Address”:“3.113.22.62”,“HostName”:“b4f47fff-d790-4962-9b48-dc5b1f1342fc”,“Incoming”:false,“InstanceName”:"/LuSXbCIlNr4FswK",“Endpoint”:“r-fm.algorand-mainnet.network:4160”,“MessageDelay”:824321043,“Reason”:“LeastPerformingPeer”},“file”:“telemetry.go”,“function”:“github.com/algorand/go-algorand/logging.(*telemetryState).logTelemetry”,“instanceName”:“2Fe82Qf+0VOaln5A”,“level”:“info”,“line”:259,“msg”:"/Network/DisconnectPeer",“name”:"",“session”:"",“time”:“2021-10-02T11:28:35.956543Z”}

{“file”:“catchpointService.go”,“function”:“github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).processStageLedgerDownload”,“level”:“warning”,“line”:313,“msg”:“unable to download ledger : context deadline exceeded”,“name”:"",“time”:“2021-10-02T11:20:21.660196Z”}

{“callee”:“github.com/algorand/go-algorand/ledger.(*CatchpointCatchupAccessorImpl).ResetStagingBalances.func1”,“caller”:“github.com/algorand/go-algorand/ledger/catchupaccessor.go:190",“file”:“dbutil.go”,“function”:"github.com/algorand/go-algorand/util/db.(*Accessor).atomic”,“level”:“warning”,“line”:344,“msg”:“dbatomic: tx surpassed expected deadline by 12m13.244978086s”,“name”:"",“readonly”:false,“time”:“2021-10-02T11:32:35.905265Z”}

$ goal node status -w 1000
Last committed block: 26983
Sync Time: 23274.2s
Catchpoint: 16570000#I6A3FR72VGNMBGINWOP5K6AYKTV7ZZUKONYHLBHLWQQSYRN2EPMA
Catchpoint total accounts: 13459624
Catchpoint accounts processed: 261632
Catchpoint accounts verified: 0
Genesis ID: mainnet-v1.0

$ goal version -v -d /var/lib/algorand/
Version: [v1 v2]
GenesisID: mainnet-v1.0
Build: 3.0.1.stable [rel/stable] (commit #b619b940)

Same behavior on Ubuntu 20.04 LTS too.

What kind of disk are you using?

Because of the current TPS, an SSD is required to run a node on Algorand.
Some older SATA SSD may not be fast enough and NVMe SSD may be required.
In any case, a node will not work if your disk is a hard drive or an SD card (even the latest generation of SD cards).

Thank you Fabrice. The host has SATA SSD. However 6 million accounts are processed flawlessly, then no updates for 10 minutes and eventually catchup restarts.

Another thing, AlgoExplorer shows a current TPS of 12-13 which doesn’t look really big. Or am I reading the data wrong? Probably it’s the sync size causing the issue.

If the first 6m of account were going fast, and it stopped afterward, I would think there was some type of a random communication problem.

Do you see this behaviour repeatibly ?

tsachi, yeah happened every time.
I did connectivity tests to internet during freeze such as ping to yahoo, google but no drops seen.

cpu < 2%, memory < 2gb. quite normal. Internet is 50 Mbps. Algod process is active too.

maybe I should do tcpdump too to see which side not responding. Have you seen these drop or DNS logs before?

The DNS logs are not an issue. The “proof” for that is that it was downloading the catchpoint file - which means that is was able to find a relay to download these from.

The accounts writing speed should be (roughly) constant. If it stops it means that something happened…

How long does it takes before the “Catchpoint accounts processed” gets to 6m ? It should be in the order of minutes ( 5-10 ). If it takes notably longer ( i.e. 90 minutes + ), then it probably means that your disk is too slow.

Thanks tsachi, for some reason I only get only 110 Mbps sequential IOPS on my disk and less than 1 Mbps 4K RW. Didn’t realize it was so slow.

I’ll fix it first and then try running the node again.