Faisal
Error: error reading response body
We have the following environment setup and configurations for an HLF 1.4.2 network.
1- Network deployed using docker swarm with service recovery 2- Peer and CouchDB directories mapped in order to persist data to maintain state after crashes 3- Prometheus deployed for event monitoring and log aggregation 4- Configuration for CouchDB set to a. - CORE_LEDGER_STATE_COUCHDBCONFIG_REQUESTTIMEOUT=120s b. - CORE_LEDGER_STATE_COUCHDBCONFIG_MAXRETRIES=5 c. - CORE_LEDGER_STATE_COUCHDBCONFIG_MAXUPDATEBATCHSIZE=5000 d. - CORE_LEDGER_STATE_COUCHDBCONFIG_INTERNALQUERYLIMIT=5000 e. - CORE_LEDGER_STATE_COUCHDBCONFIG_TOTALQUERYLIMIT=5000
The CouchDB and Peer are running on the same HOST and the specs for the host are given below
RAM: 8GB Processor: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz Storage: 400GB CPU Cores: 2
The Peer sends a request to CouchDB for committing a block and in return gets an invalid response or the connection times out which causes the peer to crash. The logs can be found in the peer.log file attached.
As a result of this only the PEER container crashes, and the swarm creates a new peer container to recover. The CouchDB container doesn’t crash and maintains the healthy state. When we looked at the Prometheus dashboard to check if there was any downtime for the peer, we see that there was no downtime recorded may be because the scrape_interval is set to 30sec and the service returns to its healthy state before it. Please note that after the crash we don’t lose any state or see issues with any of the other services. https://jira.hyperledger.org/browse/FAB-16611?jql=text%20~%20%22couch%20db%20timeout%22
Expectation: we are looking for either a fixed version of CouchDB that doesn't throw this error or fixed code in the peer so that if CouchDB returns this error the peer can retry the request or handle it gracefully without crashing. |
|