一个服务假死问题

我写了一个服务调用es做增删改查的，结果前些天有一起假死事件，现象是服务全都没流量了，请求进不来，rpc服务队列里面堆积了好多请求都超时了。过了十几分钟后这个现象自己就缓解了。
当时rpc服务的状态是这样的：

这个代表流量，中间一段时间没流量了，请求进不来，但是我的服务没有死进程还在，所以我才说是假死。

由于当时电脑不在边上让同事帮忙看的，很多问题没法复现，我只知道几个数据：
通过我自己记录的超时日志发现那一瞬间很多请求耗时非常高（es查询返回的took 以及调用actionGet方法消耗的时间都比较长），一段时间很多操作能花几十秒甚至一两分钟。
当时的GC频率不高都是年轻带GC，老年代GC也有但是只有十来次，也没有出现concurrent mode failure，qps也属于正常的范围，日志记录的耗时较长的查询或修改但是我再拿去执行的时候发现其实耗时并不长，也许是因为请求超时了所以才会统计到耗时较长吧，感觉这个线索不具备可靠性。

还有一种可能，就是服务层和es集群的连接数满了，导致那一会儿请求es集群的任务都阻塞了？我的客户端是这么写的：

public class ClientManager {

    private static Logger logger = LogManager.getLogger(ClientManager.class.getName());



    private static final String CLUSTER_NAME = "cluster.name";

    private static final String ES_SERVICES = "es.services";



    private Client client;



    private static class ClientManagerHolder {

        private ClientManagerHolder() {

        }



        private static final ClientManager INSTANCE = new ClientManager();

    }



    public static Client getClient() {

        return ClientManagerHolder.INSTANCE.client;

    }



    private ClientManager() {

        if (client == null) {

            createClient();

        }

    }



    private void createClient() {

        // init



        String configPath = Path.getCurrentPath() + "/../config/app.properties";

        logger.info("######## appConfig配置文件路径  " + configPath);

        AppConfig.init(configPath);



        try {

            String clusterName = AppConfig.getProperty(CLUSTER_NAME);

            String services = AppConfig.getProperty(ES_SERVICES);

            logger.debug("es.services:" + services);

            logger.debug("clusterName:" + clusterName);

            Settings settings = Settings.settingsBuilder().put("cluster.name", clusterName)

                    .put("client.transport.sniff", true).put("client.transport.ignore_cluster_name", true)

                    .put("client.transport.ping_timeout", "1s").put("client.transport.nodes_sampler_interval", "1s")

                    .build();

            // add delete-by-query plugin

            TransportClient c = TransportClient.builder().settings(settings).addPlugin(DeleteByQueryPlugin.class)

                    .build();

            String servicesArray;

            if (StringUtils.isNotBlank(services)) {

                servicesArray = services.split(",");

                for (String service : servicesArray) {

                    String serviceInfo = service.split(":");

                    if (serviceInfo.length > 1) {

                        c = c.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(serviceInfo[0]),

                                Integer.valueOf(serviceInfo[1])));

                    }

                }

                client = c;

                logger.info("connect to es cluster success.");

            } else {

                logger.error(" has no services info.");

            }



        } catch (Exception e) {

            logger.error("create es client failed.", e);

        }

    }



}

大致就是做了个单例，但是我不太清楚esclient 有没有做连接池或者请求关闭等操作？总之我是没有手动调用过close方法的，不知道是不是这块导致连接池资源都释放不掉了。

es有关连接池部分的配置我也发一下吧：
threadpool:
index:
type: fixed
size: 24
queue_size: 500
bulk:
type: fixed
size: 24
queue_size: 500
action.write_consistency: one
index.store.type: mmapfs
indices.memory.index_buffer_size: 10%
index.translog.flush_threshold_ops: 50000
index.translog.flush_threshold_size: 500mb
index.translog.flush_threshold_period: 10m
indices.memory.min_translog_buffer_size: 512m
indices.memory.max_translog_buffer_size: 512m
indices.queries.cache.size: 512m
indices.queries.cache.count: 5000

求大神帮忙分析

然后顺带再问一个问题，这个TransportClient 默认是连接池的吗？这样做成单例了我就不能close了不会有问题么？
然后近期发现haoxiang官方还有一个RestClient，据说tcp的方式已经不推荐了，不知道大家现在流行用哪种。

7 个回复

发起人

活动推荐

相关问题

问题状态

一个服务假死问题

与内容相关的链接

7 个回复

发起人

活动推荐

相关问题

问题状态