问题描述

Kubernetes 集群中一个 node 上部署的 SpringBoot 程序在连接 redis 集群时报错,提示 connect timed out ,详细报错如下:

Caused by: org.springframework.beans.BeanInstantiationException: Failed to instantiate [org.springframework.data.redis.connection.jedis.JedisConnectionFactory]: Factory method 'jedisConnectionFactory' threw exception; nested exception is redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketTimeoutException: connect timed out
    at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:185)
    at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:582)
    ... 89 common frames omitted
Caused by: redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketTimeoutException: connect timed out
    at redis.clients.jedis.Connection.connect(Connection.java:207)
    at redis.clients.jedis.BinaryClient.connect(BinaryClient.java:93)
    at redis.clients.jedis.Connection.sendCommand(Connection.java:126)
    at redis.clients.jedis.Connection.sendCommand(Connection.java:117)
    at redis.clients.jedis.BinaryClient.auth(BinaryClient.java:564)
    at redis.clients.jedis.BinaryJedis.auth(BinaryJedis.java:2138)
    at redis.clients.jedis.JedisClusterConnectionHandler.initializeSlotsCache(JedisClusterConnectionHandler.java:36)
    at redis.clients.jedis.JedisClusterConnectionHandler.<init>(JedisClusterConnectionHandler.java:17)
    at redis.clients.jedis.JedisSlotBasedConnectionHandler.<init>(JedisSlotBasedConnectionHandler.java:24)
    at redis.clients.jedis.BinaryJedisCluster.<init>(BinaryJedisCluster.java:54)
    at redis.clients.jedis.JedisCluster.<init>(JedisCluster.java:93)
    at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.createCluster(JedisConnectionFactory.java:423)
    at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.createCluster(JedisConnectionFactory.java:393)
    at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.afterPropertiesSet(JedisConnectionFactory.java:350)
    at com.lanweihong.hotel.pms.redis.RedisConfig.jedisConnectionFactory(RedisConfig.java:76)
    at com.lanweihong.hotel.pms.redis.RedisConfig$$EnhancerBySpringCGLIB$$7294383d.CGLIB$jedisConnectionFactory$1(<generated>)
    at com.lanweihong.hotel.pms.redis.RedisConfig$$EnhancerBySpringCGLIB$$7294383d$$FastClassBySpringCGLIB$$bdb4410c.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:228)
    at org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:365)
    at com.lanweihong.hotel.pms.redis.RedisConfig$$EnhancerBySpringCGLIB$$7294383d.jedisConnectionFactory(<generated>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:154)
    ... 90 common frames omitted
Caused by: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at redis.clients.jedis.Connection.connect(Connection.java:184)
    ... 114 common frames omitted

问题分析

看报错内容,肯定是无法连接 redis 集群了,大概率是网络问题,开始排查吧。

  1. 其他 node 上的程序可以正常访问 redis 集群,程序运行没问题,排除程序镜像问题;

  2. 检测有问题的 node 物理网络,发现可以 ping 通 redis 集群主机,且telnet redis 端口均正常,排除 node 物理网络问题;

既然程序镜像和 node 物理网络均无问题,由于是使用 kubernetes 部署,怀疑是集群导致的网络转发故障,于是就搜索相关问题,发现在 iptables snat规则缺失,kubernetes集群问题node上所有容器无法ping通外网 有类似问题,问题可能是出在 iptables 上,容器的报文被送离 node 时没有做 snat

于是我对比了有问题的 node 上的 iptables 规则和正常运行的 node 的 iptables 规则,发现有问题的 node 的 iptables 缺失一条规则:

-A POSTROUTING -s 10.244.0.0/16 ! -o docker0 -j MASQUERADE

解决方案

于是,我尝试添加这条规则后,问题消失,程序正常运行。

iptables -t nat -A POSTROUTING -s 10.244.0.0/16 ! -o docker0 -j MASQUERADE

其中 -s 指定的是 node 分配的虚拟网段,我这里是 10.244.0.0/16

至于为什么缺失了这条 iptables 规则,我就不明白了。可能是之前由于清理 flannel 网络误删了,也可能是其他原因,有解决类似问题的大牛可否指导指导?

文章目录