在进行k8s节点升级后,我们的Solr集群出现了故障。由于我们的xdbsearch
服务依赖于Solr集群,这个故障导致xdbsearch pod无法正常运行。同时,依赖于xdbsearch的其他pods也无法正常运行,从而导致整个项目不可用。
使用Lens操作K8s, 它会自动映射端口
在Network > Services > Solr Cluster Headless 打开solr admin panel. 映射端口会定期更新,
Solr Admin Panel 错误日志
:org.apache.solr.common.SolrException: Error getting leader from zk for shard shard1
at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1395)
at org.apache.solr.cloud.ZkController.register(ZkController.java:1239)
at org.apache.solr.cloud.ZkController.register(ZkController.java:1172)
at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:191)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.solr.common.SolrException: No registered leader was found after waiting for 1560000ms , collection: _master_school_bisbbudapest_index slice: shard1 saw state=DocCollection(_master_school_bisbbudapest_index//collections/_master_school_bisbbudapest_index/state.json/38)={
"pullReplicas":"0",
"replicationFactor":"3",
"shards":{"shard1":{
"range":"80000000-7fffffff",
"state":"active",
"replicas":{
"core_node2":{
"core":"_master_school_bisbbudapest_index_shard1_replica_n1",
"base_url":"http://solr-cluster-0.solr-cluster-headless.rome:8983/solr",
"node_name":"solr-cluster-0.solr-cluster-headless.rome:8983_solr",
"state":"down",
"type":"NRT",
"force_set_state":"false"},
"core_node5":{
"core":"_master_school_bisbbudapest_index_shard1_replica_n3",
"base_url":"http://solr-cluster-0.solr-cluster-headless.rome:8983/solr",
"node_name":"solr-cluster-0.solr-cluster-headless.rome:8983_solr",
"state":"down",
"type":"NRT",
"force_set_state":"false"},
"core_node6":{
"core":"_master_school_bisbbudapest_index_shard1_replica_n4",
"base_url":"http://solr-cluster-2.solr-cluster-headless.rome:8983/solr",
"node_name":"solr-cluster-2.solr-cluster-headless.rome:8983_solr",
"state":"down",
"type":"NRT",
"force_set_state":"false"}}}},
"router":{"name":"compositeId"},
"maxShardsPerNode":"1",
"autoAddReplicas":"false",
"nrtReplicas":"3",
"tlogReplicas":"0"} with live_nodes=[solr-cluster-2.solr-cluster-headless.rome:8983_solr, solr-cluster-1.solr-cluster-headless.rome:8983_solr, solr-cluster-0.solr-cluster-headless.rome:8983_solr]
at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:981)
at org.apache.solr.common.cloud.ZkStateReader.getLeaderUrl(ZkStateReader.java:926)
at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1363)
... 7 more
我们在xdbsearch
pod的日志中看到了以下错误信息:
Health check "XConnect SolrCloud cluster live collection health check" completed after 34.1859ms with status Unhealthy and '"SolrCloud Cluster is unhealthy. The response: {
\"responseHeader\":{
\"status\":0,
\"QTime\":31},
\"cluster\":{
\"collections\":{
\"sitecore_xdb_internal\":{
\"pullReplicas\":\"0\",
\"replicationFactor\":\"1\",
\"shards\":{\"shard1\":{
\"range\":\"80000000-7fffffff\",
\"state\":\"active\",
\"replicas\":{\"core_node2\":{
\"core\":\"sitecore_xdb_internal_shard1_replica_n1\",
\"base_url\":\"http://solr-cluster-1.solr-cluster-headless.rome:8983/solr\",
\"node_name\":\"solr-cluster-1.solr-cluster-headless.rome:8983_solr\",
\"state\":\"down\",
\"type\":\"NRT\",
\"force_set_state\":\"false\",
\"leader\":\"true\"}}}},
\"router\":{\"name\":\"compositeId\"},
\"maxShardsPerNode\":\"1\",
\"autoAddReplicas\":\"false\",
\"nrtReplicas\":\"1\",
\"tlogReplicas\":\"0\",
\"znodeVersion\":14,
\"configName\":\"sitecore_xdb_internal_config\"}},
\"aliases\":{
\"sitecore_xdb\":\"sitecore_xdb_internal\",
\"sitecore_xdb_rebuild\":\"sitecore_xdb_rebuild_internal\"},
\"live_nodes\":[\"solr-cluster-2.solr-cluster-headless.rome:8983_solr\",
\"solr-cluster-1.solr-cluster-headless.rome:8983_solr\",
\"solr-cluster-0.solr-cluster-headless.rome:8983_solr\"]}}
这个错误信息显示,SolrCloud集群的健康检查失败了。具体来说,sitecore_xdb_internal
集合中的一个副本的状态为down。
解决这个问题的一种可能的方法是,检查并确保所有的Solr实例都是活动的,并且ZooKeeper中的信息是最新的。如果发现有任何死掉的Solr实例,可以使用ZooKeeper的zkCli.sh脚本来删除这些实例,然后重启Solr。
然而,在我们的情况下,由于sitecore_xdb_internal
集合是别名sitecore_xdb
的一部分,我们不能直接删除这个集合。因此,我们需要先删除sitecore_xdb_internal
的别名sitecore_xdb
,然后再删除sitecore_xdb_internal
集合和配置。
以下是我们的解决步骤:
/solr/admin/collections?action=DELETEALIAS&name=sitecore_xdb
sitecore_xdb_internal
集合和配置:$deleteCollection = "http://localhost:{0}/solr/admin/collections?action=DELETE&name={1}"
$deleteConfig = "http://localhost:{0}/solr/admin/configs?action=DELETE&name={1}_config"
sitecore_xdb_internal
集合和配置 (replicationFactor指的是solr实例的数量):$configUrl = "http://localhost:{0}/solr/admin/configs?action=CREATE&name={1}_config&baseConfigSet=sitecore_master_index_config&configSetProp.immutable=false&wt=xml&omitHeader=true"
$collectionUrl = "http://localhost:{0}/solr/admin/collections?action=CREATE&name={1}&collection.configName={1}_config&numShards=1&replicationFactor=3"
/solr/admin/collections?action=CREATEALIAS&name=sitecore_xdb&collections=sitecore_xdb_internal
如果遇到 solr-cluster-headless 端口不更新并并且无法访问时,需要关闭Lens重新打开(点击File-> Exit, 直接点击 右上角x,好像无效)
xdbsearch Pod恢复后,其它pod,比如cm, cd可能会有类似的问题,也都是solr collection相关的错误,也是删除重新创建即可。
通过以上步骤,我们成功地解决了由于Solr集群故障导致的项目不可用问题。这个经验教训我们,在进行k8s节点升级或者其他可能影响到Solr集群的操作时,我们需要更加小心,确保我们的操作不会影响到Solr集群的正常运行。同时,我们也需要对Solr集群的管理和维护有足够的了解,以便在出现问题时,我们能够迅速地找到问题的原因,并采取有效的措施来解决问题。希望这篇博客对你有所帮助!
还没有人评论,抢个沙发吧...