Missed data/transactions while stress testing #fabric #fabric-questions #network #fabric-orderer #hyperledger-fabric


jefferson.rs@...
 

Hi.
 
My team is running a couple of stress tests over HLF 2.3.0 in a network with 3 orderers and 2 peers (each peer is of different org), sending an array of JSON data through an API developed with Node SDK 2.2 (transactions are of approx. 50KB size).
 
While running a load of some millions of transactions we've observed that there were a couple of documents missing in the ledger. At the same time the API didn't received any error from these missing transactions.
 
During the test, we've noticed some WARN messages in orderer logs that might be a clue for this situation but anyways they are not getting returned as errors to the API. So we are not sure if these messages might be related to it or not:
 
[33m2021-02-01 12:10:06.708 UTC [orderer.common.broadcast] ProcessMessage -> WARN 88af9a2 [channel: ch1] Rejecting broadcast of normal message from 57.145.150.41:54444 with SERVICE_UNAVAILABLE: rejected by Order: aborted
 
[33m2021-01-26 14:16:16.530 UTC [orderer.common.cluster.step] sendMessage -> WARN 33245e[0m Stream 7 to orderer1(orderer1:443) was forcibly terminated because timeout (7s) expired
[33m2021-01-26 14:18:22.123 UTC [orderer.consensus.etcdraft] run -> WARN 19e666[0m WAL sync took 29.466370256 seconds and the network is configured to start elections after 5 seconds. Your disk is too slow and may cause loss of quorum and trigger leadership election. channel=ch1 node=2
 
So we have a couple of doubts that we would like to get some feedback from the community:
- are there any internal errors (more related to orderers) that might not return to the API and cause missing transactions?
- if this is true, which would be the best way to assure that this data get registered into the ledger? Wouldn't the orderer be "smart enough" that an error ocurred and replay the transaction itself after some time, as the transaction was already proposed and approved by a peer? Would any listener be able to catch failures like this so it enable us to do some replay in the API? If so, could someone provide an example, please?
 
Thanks in advance.

Jeff.


David Enyeart
 

When you submit a transaction to ordering, it is not guaranteed to get ordered into a block. If an orderer encounters an issue (as it looks like yours has due to stress), the transaction may not get ordered. In a distributed system the time to commit cannot be reliably predicted, therefore the orderer returns success and then processes the submission asynchronously. Client applications need to listen for transaction events regardless since the transaction may ultimately get invalidated even if it is ordered. Most client applications will listen for transaction events, and then resubmit upon timeout or invalidation.

This is mentioned a few places in the docs, but we have opened https://jira.hyperledger.org/browse/FAB-18420 to make it more clear specifically in the Developing Applications topic.


Dave Enyeart

jefferson.rs---02/09/2021 02:41:14 PM---Hi. My team is running a couple of stress tests over HLF 2.3.0 in a network with 3 orderers and 2 pe

From: jefferson.rs@...
To: fabric@...
Date: 02/09/2021 02:41 PM
Subject: [EXTERNAL] [Hyperledger Fabric] Missed data/transactions while stress testing #fabric #network #hyperledger-fabric #fabric-questions #fabric-orderer
Sent by: fabric@...





Hi.   My team is running a couple of stress tests over HLF 2.3.0...
This Message Is From an External Sender
This message came from outside your organization.
Hi.

My team is running a couple of stress tests over HLF 2.3.0 in a network with 3 orderers and 2 peers (each peer is of different org), sending an array of JSON data through an API developed with Node SDK 2.2 (transactions are of approx. 50KB size).

While running a load of some millions of transactions we've observed that there were a couple of documents missing in the ledger. At the same time the API didn't received any error from these missing transactions.

During the test, we've noticed some WARN messages in orderer logs that might be a clue for this situation but anyways they are not getting returned as errors to the API. So we are not sure if these messages might be related to it or not:

[33m2021-02-01 12:10:06.708 UTC [orderer.common.broadcast] ProcessMessage -> WARN 88af9a2 [channel: ch1] Rejecting broadcast of normal message from 57.145.150.41:54444 with SERVICE_UNAVAILABLE: rejected by Order: aborted

[33m2021-01-26 14:16:16.530 UTC [orderer.common.cluster.step] sendMessage -> WARN 33245e[0m Stream 7 to orderer1(orderer1:443) was forcibly terminated because timeout (7s) expired
[33m2021-01-26 14:18:22.123 UTC [orderer.consensus.etcdraft] run -> WARN 19e666[0m WAL sync took 29.466370256 seconds and the network is configured to start elections after 5 seconds. Your disk is too slow and may cause loss of quorum and trigger leadership election. channel=ch1 node=2

So we have a couple of doubts that we would like to get some feedback from the community:
- are there any internal errors (more related to orderers) that might not return to the API and cause missing transactions?
- if this is true, which would be the best way to assure that this data get registered into the ledger? Wouldn't the orderer be "smart enough" that an error ocurred and replay the transaction itself after some time, as the transaction was already proposed and approved by a peer? Would any listener be able to catch failures like this so it enable us to do some replay in the API? If so, could someone provide an example, please?

Thanks in advance.

Jeff.




jefferson.rs@...
 

Hi Dave.
 
Thank you very much for your response and clarifications about this subject. The documentation update will be more than welcome as I feel it is a very important observation to be done and the docs should cover it.
 
I've been experimenting and looking at the Node and Go SDK documentations lately and the most complete code example I found was this one: https://github.com/hyperledger/fabric-sdk-node/blob/release-2.2/test/ts-scenario/config/handlers/sample-transaction-event-handler.ts (linked in the docs at https://hyperledger.github.io/fabric-sdk-node/release-2.2/tutorial-transaction-commit-events.html).
 
We've changed our gateway object's eventHandlerOptions strategy, pointing to this customized event handler and did some tests locally. The handler appeared to work as expected in these 2 situations: when the transaction completes with success and when the timeout exceeds. 
 
We didn't manage to test under the network we are stress testing yet so we are not sure and don't know if the handler will receive any event in this case. I had the impression looking through source code and documentation is that the event listener is attached to what happens in peers, but in this case it seems that the event listener would need to listen to events and failures that are fired by orderers. But this might be a misunderstanding by myself.
 
Will keep this topic updated after we run our stress tests again with the customized event handler.