SUGGESTED FIX
During reconnection, if the fetching job is in the state "STARTED", we should not ask it to end, the job should be either busy on distributing notifications received to listeners, or blocked on waiting reconnection. We only need to do specially clean job for this case when the reconnection is finished.
A test is attached to reproduce the problem and to verify the solution suggested.
|
|
|
EVALUATION
Today we end the fetching job because we need to do some clean after the reconnection, that's why the method "preReconnection()" set the flag "STOPPING" for the fetching job, and the method "preReconnection()" needs waiting the job end. Obviously this logic makes deadlock.
One solution is to not ask the job to end during the reconnection, but in the method "postReconnection" we do necessary cleaning, we should be very careful because the clean job is different according to the state of the fetching job.
|
|
|
EVALUATION
If my previous evaluation is correct, the solution is:
when the fetching thread finds the state = RE_CONNECTING, instead of waiting the state change, it may check whether another state is equal to STOPPING, if yes it should do stopping to free the thread doing the reconnection.
We need the test to verify the previous evaluation and this solution.
|
|
|
EVALUATION
For B), the reason of blocking is:
One user thread was doing query (querying thread) and got an IOException, this thread was used to do reconnection process, we have another thread created by the client connector doing notif feching (fetching thread), it got IOException too and found that another thread was doing reconnection so it was waiting.
After the connection was reestablished, the user thread doing reconnection process was used to 1) re-add automatically all listeners to the remote client, 2) set a flag to tell the fetching thread to stop because a new fetching thread would be created after the reconnection was completed.
The problem seems clear now: the user thread was waiting the fetching thread to die before creating a new fetching thread, but the current fetching thread was waiting the user thread to finish reconnection.
|
|
|
|