Solr集群故障导致的项目不可用问题及解决方案

问题描述

在进行k8s节点升级后,我们的Solr集群出现了故障。由于我们的xdbsearch服务依赖于Solr集群,这个故障导致xdbsearch pod无法正常运行。同时,依赖于xdbsearch的其他pods也无法正常运行,从而导致整个项目不可用。

背景

使用Lens操作K8s, 它会自动映射端口
在Network > Services > Solr Cluster Headless 打开solr admin panel. 映射端口会定期更新,

错误信息

Solr Admin Panel 错误日志

:org.apache.solr.common.SolrException: Error getting leader from zk for shard shard1
    at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1395)
    at org.apache.solr.cloud.ZkController.register(ZkController.java:1239)
    at org.apache.solr.cloud.ZkController.register(ZkController.java:1172)
    at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:191)
    at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.solr.common.SolrException: No registered leader was found after waiting for 1560000ms , collection: _master_school_bisbbudapest_index slice: shard1 saw state=DocCollection(_master_school_bisbbudapest_index//collections/_master_school_bisbbudapest_index/state.json/38)={
  "pullReplicas":"0",
  "replicationFactor":"3",
  "shards":{"shard1":{
      "range":"80000000-7fffffff",
      "state":"active",
      "replicas":{
        "core_node2":{
          "core":"_master_school_bisbbudapest_index_shard1_replica_n1",
          "base_url":"http://solr-cluster-0.solr-cluster-headless.rome:8983/solr",
          "node_name":"solr-cluster-0.solr-cluster-headless.rome:8983_solr",
          "state":"down",
          "type":"NRT",
          "force_set_state":"false"},
        "core_node5":{
          "core":"_master_school_bisbbudapest_index_shard1_replica_n3",
          "base_url":"http://solr-cluster-0.solr-cluster-headless.rome:8983/solr",
          "node_name":"solr-cluster-0.solr-cluster-headless.rome:8983_solr",
          "state":"down",
          "type":"NRT",
          "force_set_state":"false"},
        "core_node6":{
          "core":"_master_school_bisbbudapest_index_shard1_replica_n4",
          "base_url":"http://solr-cluster-2.solr-cluster-headless.rome:8983/solr",
          "node_name":"solr-cluster-2.solr-cluster-headless.rome:8983_solr",
          "state":"down",
          "type":"NRT",
          "force_set_state":"false"}}}},
  "router":{"name":"compositeId"},
  "maxShardsPerNode":"1",
  "autoAddReplicas":"false",
  "nrtReplicas":"3",
  "tlogReplicas":"0"} with live_nodes=[solr-cluster-2.solr-cluster-headless.rome:8983_solr, solr-cluster-1.solr-cluster-headless.rome:8983_solr, solr-cluster-0.solr-cluster-headless.rome:8983_solr]
    at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:981)
    at org.apache.solr.common.cloud.ZkStateReader.getLeaderUrl(ZkStateReader.java:926)
    at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1363)
    ... 7 more

我们在xdbsearch pod的日志中看到了以下错误信息:

Health check "XConnect SolrCloud cluster live collection health check" completed after 34.1859ms with status Unhealthy and '"SolrCloud Cluster is unhealthy. The response: {
\"responseHeader\":{
\"status\":0,
\"QTime\":31},
\"cluster\":{
\"collections\":{
\"sitecore_xdb_internal\":{
\"pullReplicas\":\"0\",
\"replicationFactor\":\"1\",
\"shards\":{\"shard1\":{
\"range\":\"80000000-7fffffff\",
\"state\":\"active\",
\"replicas\":{\"core_node2\":{
\"core\":\"sitecore_xdb_internal_shard1_replica_n1\",
\"base_url\":\"http://solr-cluster-1.solr-cluster-headless.rome:8983/solr\",
\"node_name\":\"solr-cluster-1.solr-cluster-headless.rome:8983_solr\",
\"state\":\"down\",
\"type\":\"NRT\",
\"force_set_state\":\"false\",
\"leader\":\"true\"}}}},
\"router\":{\"name\":\"compositeId\"},
\"maxShardsPerNode\":\"1\",
\"autoAddReplicas\":\"false\",
\"nrtReplicas\":\"1\",
\"tlogReplicas\":\"0\",
\"znodeVersion\":14,
\"configName\":\"sitecore_xdb_internal_config\"}},
\"aliases\":{
\"sitecore_xdb\":\"sitecore_xdb_internal\",
\"sitecore_xdb_rebuild\":\"sitecore_xdb_rebuild_internal\"},
\"live_nodes\":[\"solr-cluster-2.solr-cluster-headless.rome:8983_solr\",
\"solr-cluster-1.solr-cluster-headless.rome:8983_solr\",
\"solr-cluster-0.solr-cluster-headless.rome:8983_solr\"]}}

这个错误信息显示,SolrCloud集群的健康检查失败了。具体来说,sitecore_xdb_internal集合中的一个副本的状态为down。

解决方案

解决这个问题的一种可能的方法是,检查并确保所有的Solr实例都是活动的,并且ZooKeeper中的信息是最新的。如果发现有任何死掉的Solr实例,可以使用ZooKeeper的zkCli.sh脚本来删除这些实例,然后重启Solr。
然而,在我们的情况下,由于sitecore_xdb_internal集合是别名sitecore_xdb的一部分,我们不能直接删除这个集合。因此,我们需要先删除sitecore_xdb_internal的别名sitecore_xdb,然后再删除sitecore_xdb_internal集合和配置。

以下是我们的解决步骤:

  1. 删除别名sitecore_xdb:
/solr/admin/collections?action=DELETEALIAS&name=sitecore_xdb
  1. 删除sitecore_xdb_internal集合和配置:
$deleteCollection = "http://localhost:{0}/solr/admin/collections?action=DELETE&name={1}"
$deleteConfig = "http://localhost:{0}/solr/admin/configs?action=DELETE&name={1}_config"
  1. 重新创建sitecore_xdb_internal集合和配置 (replicationFactor指的是solr实例的数量):
$configUrl = "http://localhost:{0}/solr/admin/configs?action=CREATE&name={1}_config&baseConfigSet=sitecore_master_index_config&configSetProp.immutable=false&wt=xml&omitHeader=true"
$collectionUrl = "http://localhost:{0}/solr/admin/collections?action=CREATE&name={1}&collection.configName={1}_config&numShards=1&replicationFactor=3"
  1. 创建别名:
/solr/admin/collections?action=CREATEALIAS&name=sitecore_xdb&collections=sitecore_xdb_internal

其它

如果遇到 solr-cluster-headless 端口不更新并并且无法访问时,需要关闭Lens重新打开(点击File-> Exit, 直接点击 右上角x,好像无效)

xdbsearch Pod恢复后,其它pod,比如cm, cd可能会有类似的问题,也都是solr collection相关的错误,也是删除重新创建即可。

结论

通过以上步骤,我们成功地解决了由于Solr集群故障导致的项目不可用问题。这个经验教训我们,在进行k8s节点升级或者其他可能影响到Solr集群的操作时,我们需要更加小心,确保我们的操作不会影响到Solr集群的正常运行。同时,我们也需要对Solr集群的管理和维护有足够的了解,以便在出现问题时,我们能够迅速地找到问题的原因,并采取有效的措施来解决问题。希望这篇博客对你有所帮助!

评论

还没有人评论,抢个沙发吧...

Viagle Blog

欢迎来到我的个人博客网站