Peer crash due to error during block commit #fabric #couchdb


Faisal
 

Error:
Got error while committing(read tcp 10.0.9.49:37944->10.0.9.15:5984: read: connection reset by peer

error reading response body

 

 

We have the following environment setup and configurations for an HLF 1.4.2 network.

 

1- Network deployed using docker swarm with service recovery

2- Peer and CouchDB directories mapped in order to persist data to maintain state after crashes

3- Prometheus deployed for event monitoring and log aggregation

4- Configuration for CouchDB set to

a.    - CORE_LEDGER_STATE_COUCHDBCONFIG_REQUESTTIMEOUT=120s

b.   - CORE_LEDGER_STATE_COUCHDBCONFIG_MAXRETRIES=5

c.    - CORE_LEDGER_STATE_COUCHDBCONFIG_MAXUPDATEBATCHSIZE=5000

d.   - CORE_LEDGER_STATE_COUCHDBCONFIG_INTERNALQUERYLIMIT=5000

e.    - CORE_LEDGER_STATE_COUCHDBCONFIG_TOTALQUERYLIMIT=5000

 

The CouchDB and Peer are running on the same HOST and the specs for the host are given below

 

RAM:                           8GB

Processor:                   Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz

Storage:                      400GB

CPU Cores:                  2

 

The Peer sends a request to CouchDB for committing a block and in return gets an invalid response or the connection times out which causes the peer to crash. The logs can be found in the peer.log file attached. 

The logs for the CouchDB are also attached and an error can be seen at the same time at which the peer crashed but the error does not provide any useful information to the user.

 

As a result of this only the PEER container crashes, and the swarm creates a new peer container to recover. The CouchDB container doesn’t crash and maintains the healthy state. When we looked at the Prometheus dashboard to check if there was any downtime for the peer, we see that there was no downtime recorded may be because the scrape_interval is set to 30sec and the service returns to its healthy state before it. 

Please note that after the crash we don’t lose any state or see issues with any of the other services.

We have found that a similar issue was opened on Jira, but that issue was also closed without any conclusive solution. The Fix Version is set to v1.4.5 that is not even a valid version for HLF.

https://jira.hyperledger.org/browse/FAB-16611?jql=text%20~%20%22couch%20db%20timeout%22

 

 

Expectation:

we are looking for either a fixed version of CouchDB that doesn't throw this error or fixed code in the peer so that if CouchDB returns this error the peer can retry the request or handle it gracefully without crashing.

Join fabric@lists.hyperledger.org to automatically receive all group messages.