Hello All-
I get below error during deployments from infinispan-
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 1592914 from keycloak-4,keycloak-2
I am using Keycloak 16 bitnami chart in HA deployment in kubernetes with 5 replicas(pods) each running on separate node.
I am using default infinispan(running on keycloak pods) cache with 4 owners for cache. Using KUBE_ping for cluster sync.
cache: authOwnersCount: 4 ownersCount: 4
I had a maintenance to do for 2 nodes and I drained them and evicted 2 of these 5 Keycloak pods. I was hoping this to not have any downtime as in theory I still had 2 more pods of the remaining that would have replicated cache. But I saw below errors and also lost all user sessions. Admin panel was down and wasn’t able to get in, I ended up restarting the whole cluster to resolve this.
Given my goal is to have zero downtime and no session loss, could someone answer my concerns below -
If pods go down, isn’t there automatic re-balancing in distributed caching that it won’t try to reach dead pods?
Also, not sure why 4 cache owners were not enough to prevent downtime.
Researching more, it sounds like using separate cluster for inifinispan or offline sessions is a probable solution. Would upgrading to Keycloak 19(quarkus based) resolve any of these issues so I don’t need to do separate infinispan cluster setup?
+1 We also faced with the similar issue, but in our case we have 6 nodes and 6 cache owners. During rolling upgrade these pods replaced one by one and during this process we start to see these ISPN000476: Timed out waiting for responses for request 1592914 from keycloak-4,keycloak-2 issues that also caused issues with clients authorizations. These issues happens during short period of time and as soon as rolling deployment finished, everything started working fine. So seems like keycloak doesn’t support smooth rolling upgrade out of the box and worth trying to use external infinispan cluster.
As for loosing the sessions could it be the case that you have updated # of cache owners in the running cluster and run rolling deploy to update these settings? Unfortunately we have found out to change # of cache owners you need full cluster stop, as dynamic change of this parameter was not tested and may cause unpredictable behavior according to the replies from developers of Keycloak
Usually setting # of cache owners to the # of nodes should prevent from loosing any cache data. In our case we have issues only during deployment, but sessions were saved after this deployment. We also update pods one by one.
Thanks for sharing your experience @dionis. It wasn’t due to updating # of owners. In recent deployment when it happened, we actually lost all user sessions, the admin panel showed count of active sessions drop to zero. I am currently looking into persisting sessions into JDBC store through cache config file. We might have similar issues when external infinispan cluster goes down or encounters any issues, so backing sessions in DB seems safest for guaranteed zero downtime.
Hi @anand-kulk , did you implement sessions persistence via JDBC store? If so, is it working better and more stable now? Just trying to get some feedback if its worth to try implementing the same