Participation node catchpoint failing midway

somethingnewgood · October 2, 2021, 1:57pm

Hi team, need help. I run a participation node on Ubuntu 18.04 LTS. Algorand packages install all OK. Catchpoint update runs really well but then fails almost midway.

Observed following logs in node.log file that could indicate an issue.

{“file”:“bootstrap.go”,“function”:“github.com/algorand/go-algorand/tools/network.ReadFromSRV",“level”:“info”,“line”:43,“msg”:"ReadFromBootstrap: DNS LookupSRV failed when using system resolver: no answer for (_algobootstrap._tcp.mainnet.algorand.network., 33) from DNS servers [127.0.0.53:53]”,“time”:“2021-10-02T11:32:37.516836Z”}

{“event”:“Disconnected”,“file”:“wsNetwork.go”,“function”:“github.com/algorand/go-algorand/network.(*WebsocketNetwork).removePeer”,“level”:“info”,“line”:2211,“local”:“http:”,“messageDelay”:824321043,“msg”:“Peer r-fm.algorand-mainnet.network:4160 disconnected: LeastPerformingPeer”,“name”:"",“remote”:“r-fm.algorand-mainnet.network:4160”,“time”:“2021-10-02T11:28:35.956508Z”}

{“details”:{“Address”:“3.113.22.62”,“HostName”:“b4f47fff-d790-4962-9b48-dc5b1f1342fc”,“Incoming”:false,“InstanceName”:"/LuSXbCIlNr4FswK",“Endpoint”:“r-fm.algorand-mainnet.network:4160”,“MessageDelay”:824321043,“Reason”:“LeastPerformingPeer”},“file”:“telemetry.go”,“function”:“github.com/algorand/go-algorand/logging.(*telemetryState).logTelemetry”,“instanceName”:“2Fe82Qf+0VOaln5A”,“level”:“info”,“line”:259,“msg”:"/Network/DisconnectPeer",“name”:"",“session”:"",“time”:“2021-10-02T11:28:35.956543Z”}

{“file”:“catchpointService.go”,“function”:“github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).processStageLedgerDownload”,“level”:“warning”,“line”:313,“msg”:“unable to download ledger : context deadline exceeded”,“name”:"",“time”:“2021-10-02T11:20:21.660196Z”}

{“callee”:“github.com/algorand/go-algorand/ledger.(*CatchpointCatchupAccessorImpl).ResetStagingBalances.func1”,“caller”:“github.com/algorand/go-algorand/ledger/catchupaccessor.go:190",“file”:“dbutil.go”,“function”:"github.com/algorand/go-algorand/util/db.(*Accessor).atomic”,“level”:“warning”,“line”:344,“msg”:“dbatomic: tx surpassed expected deadline by 12m13.244978086s”,“name”:"",“readonly”:false,“time”:“2021-10-02T11:32:35.905265Z”}

$ goal node status -w 1000
Last committed block: 26983
Sync Time: 23274.2s
Catchpoint: 16570000#I6A3FR72VGNMBGINWOP5K6AYKTV7ZZUKONYHLBHLWQQSYRN2EPMA
Catchpoint total accounts: 13459624
Catchpoint accounts processed: 261632
Catchpoint accounts verified: 0
Genesis ID: mainnet-v1.0

$ goal version -v -d /var/lib/algorand/
Version: [v1 v2]
GenesisID: mainnet-v1.0
Build: 3.0.1.stable [rel/stable] (commit #b619b940)

somethingnewgood · October 2, 2021, 1:58pm

Same behavior on Ubuntu 20.04 LTS too.

fabrice · October 2, 2021, 3:59pm

What kind of disk are you using?

Because of the current TPS, an SSD is required to run a node on Algorand.
Some older SATA SSD may not be fast enough and NVMe SSD may be required.
In any case, a node will not work if your disk is a hard drive or an SD card (even the latest generation of SD cards).

somethingnewgood · October 2, 2021, 4:32pm

Thank you Fabrice. The host has SATA SSD. However 6 million accounts are processed flawlessly, then no updates for 10 minutes and eventually catchup restarts.

Another thing, AlgoExplorer shows a current TPS of 12-13 which doesn’t look really big. Or am I reading the data wrong? Probably it’s the sync size causing the issue.

tsachi · October 2, 2021, 5:18pm

If the first 6m of account were going fast, and it stopped afterward, I would think there was some type of a random communication problem.

Do you see this behaviour repeatibly ?

somethingnewgood · October 2, 2021, 5:47pm

tsachi, yeah happened every time.
I did connectivity tests to internet during freeze such as ping to yahoo, google but no drops seen.

cpu < 2%, memory < 2gb. quite normal. Internet is 50 Mbps. Algod process is active too.

maybe I should do tcpdump too to see which side not responding. Have you seen these drop or DNS logs before?

tsachi · October 2, 2021, 6:54pm

The DNS logs are not an issue. The “proof” for that is that it was downloading the catchpoint file - which means that is was able to find a relay to download these from.

The accounts writing speed should be (roughly) constant. If it stops it means that something happened…

How long does it takes before the “Catchpoint accounts processed” gets to 6m ? It should be in the order of minutes ( 5-10 ). If it takes notably longer ( i.e. 90 minutes + ), then it probably means that your disk is too slow.

somethingnewgood · October 3, 2021, 1:29pm

Thanks tsachi, for some reason I only get only 110 Mbps sequential IOPS on my disk and less than 1 Mbps 4K RW. Didn’t realize it was so slow.

I’ll fix it first and then try running the node again.

Topic		Replies	Views
Catchpoint gets stuck, fails, and restarts on pi4 4GB	2	426	September 26, 2022
Algorand Node not Syncing - my node version 2.1.4 General	3	531	April 28, 2021
Mainnet participant node syncing problem mainnet , noderunner	4	673	December 13, 2021
Cannot contact Algorand node: HTTP 400 Bad Request: unable to start catchpoint catchup General testnet	2	528	May 2, 2023
Why did syncing my participation node stall? General	0	315	May 29, 2021

Participation node catchpoint failing midway

Related topics