问题描述
Kubernetes 集群中一个 node 上部署的 SpringBoot 程序在连接 redis 集群时报错,提示 connect timed out
,详细报错如下:
Caused by: org.springframework.beans.BeanInstantiationException: Failed to instantiate [org.springframework.data.redis.connection.jedis.JedisConnectionFactory]: Factory method 'jedisConnectionFactory' threw exception; nested exception is redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketTimeoutException: connect timed out
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:185)
at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:582)
... 89 common frames omitted
Caused by: redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketTimeoutException: connect timed out
at redis.clients.jedis.Connection.connect(Connection.java:207)
at redis.clients.jedis.BinaryClient.connect(BinaryClient.java:93)
at redis.clients.jedis.Connection.sendCommand(Connection.java:126)
at redis.clients.jedis.Connection.sendCommand(Connection.java:117)
at redis.clients.jedis.BinaryClient.auth(BinaryClient.java:564)
at redis.clients.jedis.BinaryJedis.auth(BinaryJedis.java:2138)
at redis.clients.jedis.JedisClusterConnectionHandler.initializeSlotsCache(JedisClusterConnectionHandler.java:36)
at redis.clients.jedis.JedisClusterConnectionHandler.<init>(JedisClusterConnectionHandler.java:17)
at redis.clients.jedis.JedisSlotBasedConnectionHandler.<init>(JedisSlotBasedConnectionHandler.java:24)
at redis.clients.jedis.BinaryJedisCluster.<init>(BinaryJedisCluster.java:54)
at redis.clients.jedis.JedisCluster.<init>(JedisCluster.java:93)
at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.createCluster(JedisConnectionFactory.java:423)
at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.createCluster(JedisConnectionFactory.java:393)
at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.afterPropertiesSet(JedisConnectionFactory.java:350)
at com.lanweihong.hotel.pms.redis.RedisConfig.jedisConnectionFactory(RedisConfig.java:76)
at com.lanweihong.hotel.pms.redis.RedisConfig$$EnhancerBySpringCGLIB$$7294383d.CGLIB$jedisConnectionFactory$1(<generated>)
at com.lanweihong.hotel.pms.redis.RedisConfig$$EnhancerBySpringCGLIB$$7294383d$$FastClassBySpringCGLIB$$bdb4410c.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:228)
at org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:365)
at com.lanweihong.hotel.pms.redis.RedisConfig$$EnhancerBySpringCGLIB$$7294383d.jedisConnectionFactory(<generated>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:154)
... 90 common frames omitted
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at redis.clients.jedis.Connection.connect(Connection.java:184)
... 114 common frames omitted
问题分析
看报错内容,肯定是无法连接 redis 集群了,大概率是网络问题,开始排查吧。
其他 node 上的程序可以正常访问 redis 集群,程序运行没问题,排除程序镜像问题;
检测有问题的 node 物理网络,发现可以 ping 通 redis 集群主机,且
telnet
redis 端口均正常,排除 node 物理网络问题;
既然程序镜像和 node 物理网络均无问题,由于是使用 kubernetes 部署,怀疑是集群导致的网络转发故障,于是就搜索相关问题,发现在 iptables snat规则缺失,kubernetes集群问题node上所有容器无法ping通外网 有类似问题,问题可能是出在 iptables
上,容器的报文被送离 node 时没有做 snat
。
于是我对比了有问题的 node 上的 iptables 规则和正常运行的 node 的 iptables 规则,发现有问题的 node 的 iptables 缺失一条规则:
-A POSTROUTING -s 10.244.0.0/16 ! -o docker0 -j MASQUERADE
解决方案
于是,我尝试添加这条规则后,问题消失,程序正常运行。
iptables -t nat -A POSTROUTING -s 10.244.0.0/16 ! -o docker0 -j MASQUERADE
其中 -s
指定的是 node 分配的虚拟网段,我这里是 10.244.0.0/16
。
至于为什么缺失了这条 iptables 规则,我就不明白了。可能是之前由于清理 flannel 网络误删了,也可能是其他原因,有解决类似问题的大牛可否指导指导?