Rules

application alerts

14.029s ago

24.56ms

Rule State Error Last Evaluation Evaluation Time
alert: Server Busy expr: sum by(appName) (join_queue_count) - sum by(appName) (join_queue_count offset 1m) > 1000 for: 30s labels: app: android severity: warning type: application annotations: description: '{{ $labels.appName }} 1分钟内server busy的次数已经超过了10次, 当前值 {{ $value }}' summary: server busy的次数 ok 14.03s ago 17.57ms
alert: No Available Coordinator expr: sum by(appName, kubernetes_namespace) (game_busy_count{source="no avail coordinator",trigger="user"}) - sum by(appName, kubernetes_namespace) (game_busy_count{source="no avail coordinator",trigger="user"} offset 5m) > 0 for: 5m labels: app: android severity: warning type: application annotations: description: '{{ $labels.kubernetes_namespace }} {{ $labels.appName }} 5分钟内出现了 no avail coordinator, 当前值 {{ $value }}' summary: no avail coordinator 的次数 ok 14.013s ago 196.7us
alert: Abnormal prod busy android count expr: sum(android_states_by_cr{kubernetes_namespace="prod",status="busy"}) < 0 for: 3m labels: app: android severity: warning type: application annotations: description: 如连续报警请检查 prepay instances summary: prod busy android 数量异常 ok 14.012s ago 1.329ms
alert: Prepay instance not ready expr: not_ready_count - (not_ready_count offset 1m) > 0 for: 3m labels: app: android severity: warning type: application annotations: description: 请检查 prepay instances in cluster {{ $labels.clusterName}} summary: prepay instance in not ready state ok 14.011s ago 93.47us
alert: Apiserver is down? expr: absent(apiserver_request_total) == 1 for: 3m labels: app: android severity: warning type: application annotations: description: apiserver may be unhealthy summary: no apiserver_request_total metrics ok 14.012s ago 5.358ms

source alerts

9.071s ago

6.371ms

Rule State Error Last Evaluation Evaluation Time
alert: HighRateOfDiskUsed expr: (node_filesystem_size_bytes{mountpoint="/rootfs"} - node_filesystem_avail_bytes{mountpoint="/rootfs"}) / node_filesystem_size_bytes{mountpoint="/rootfs"} * 100 > 85 for: 5m labels: severity: warning type: source annotations: description: 硬盘占用率超过限制85%,instance:{{$labels.instance}},device:{{$labels.device}},fstype:{{$labels.fstype}} summary: 硬盘占用率超过限制 ok 9.071s ago 397.9us
alert: Evicted Pod expr: kube_pod_container_status_terminated_reason{reason="Evicted"} == 1 for: 5m labels: severity: warning type: source annotations: description: 发生节点被驱逐状况,instance:{{$labels.instance}},namespace:{{$labels.namespace}},pod:{{$labels.pod}} summary: 发生节点被驱逐状况 ok 9.071s ago 44.88us
alert: HighRateOfMemoryUsed-Hour expr: 100 * (1 - ((avg_over_time(node_memory_MemFree_bytes[1h]) + avg_over_time(node_memory_Cached_bytes[1h]) + avg_over_time(node_memory_Buffers_bytes[1h])) / avg_over_time(node_memory_MemTotal_bytes[1h]))) > 90 for: 5m labels: severity: warning type: source annotations: description: 1小时中内存使用超过90%,instance:{{$labels.instance}},namespace:{{$labels.kubernetes_pod_name}},pod:{{$labels.name}} summary: 1小时中内存使用超过90% ok 9.071s ago 754.7us
alert: HighRateOfMemoryUsed-Minute expr: 100 * (1 - avg_over_time(node_memory_MemAvailable_bytes[1m]) / avg_over_time(node_memory_MemTotal_bytes[1m])) > 90 for: 1m labels: severity: warning type: source annotations: description: 1分钟中内存使用超过90%,instance:{{$labels.instance}},namespace:{{$labels.kubernetes_pod_name}},pod:{{$labels.name}} summary: 1分钟中内存使用超过90% ok 9.07s ago 191.8us
alert: HighRateOfCpuUsed expr: 100 * (1 - sum by(instance) (increase(node_cpu_seconds_total{mode="idle"}[1h])) / sum by(instance) (increase(node_cpu_seconds_total[1h]))) > 95 for: 5m labels: severity: warning type: source annotations: description: 1小时中cpu使用超过95%,instance:{{$labels.instance}} summary: 1小时中cpu使用超过95% ok 9.07s ago 4.968ms

calculation

11.294s ago

358.8us

Rule State Error Last Evaluation Evaluation Time
record: user:online expr: label_join(sum without(instance) (user_online), "computed", ",", "job") ok 11.294s ago 159.4us
record: user:onhook expr: label_join(sum without(instance) (hook_count), "computed", ",", "job") ok 11.294s ago 72.02us
record: user:hookusers expr: label_join(sum without(instance) (hook_count), "computed", ",", "job") ok 11.294s ago 58.2us
record: android:status expr: label_join(sum without(instance) (android_stats), "computed", ",", "job") ok 11.294s ago 59.27us

data_collect

15.266s ago

4.284s

Rule State Error Last Evaluation Evaluation Time
record: game:busy_total:day expr: label_join(sum by(job) (max_over_time(game_busy_count[1d])) - sum by(job) (min_over_time(game_busy_count[1d])), "computed", ",", "job") err query processing would load too many samples into memory in query execution 15.266s ago 4.284s
record: game:open_total:day expr: label_join(sum by(job) (max_over_time(game_open_count[1d])) - sum by(job) (min_over_time(game_open_count[1d])), "computed", ",", "job") ok 10.982s ago 182.3us
record: game:open expr: label_join(sum by(game) (max_over_time(game_open_count[1d])) - sum by(game) (min_over_time(game_open_count[1d])), "computed", ",", "game") ok 10.982s ago 111.6us

big_screen

4.476s ago

627.2us

Rule State Error Last Evaluation Evaluation Time
record: all:user:online expr: sum without(prometheus, instance) (user:online) ok 4.476s ago 125.4us
record: all:game_open:today expr: sum without(prometheus, instance) (game:open_total:day) ok 4.476s ago 34.53us
record: all:open:today expr: sum by(job) (all:game_open:today) ok 4.476s ago 22.9us
record: all:game_runing_count:today expr: count(all:game_open:today) ok 4.476s ago 28.86us
record: all:game_open_top:10 expr: topk(10, sum by(game) (game:open_total:day)) ok 4.476s ago 53.27us
record: all:ninety_percent_played_game:day expr: topk(10, histogram_quantile(0.1, sum by(game, le) (game:played_time_bucket_total:day))) ok 4.476s ago 54.79us
record: all:node_count expr: count(node:memory_usage_ratio) ok 4.476s ago 139.2us
record: all:server_busy:count expr: sum by(job) (game:busy_total:day) ok 4.476s ago 35.61us
record: all:played_time:day expr: sum by(game) (game:played_time_sum_total:day) ok 4.476s ago 22.39us
record: all:played_count:day expr: sum by(game) (game:played_time_count:day) ok 4.476s ago 23.18us
record: all:avg_played_time_top:10 expr: topk(10, all:played_time:day / all:played_count:day) ok 4.476s ago 49.61us
record: all:black_count:day expr: sum without(prometheus, instance) (game:black_count:day) ok 4.476s ago 22.06us

cluster_namespace_containers

8.598s ago

431.4us

Rule State Error Last Evaluation Evaluation Time
record: cluster:namespace_container_running_count expr: label_join(sum by(namespace) (kube_pod_container_status_running), "computed", ",", "namespace") ok 8.598s ago 144.1us
record: cluster:namespace_container_waiting_count expr: label_join(sum by(namespace) (kube_pod_container_status_waiting), "computed", ",", "namespace") ok 8.598s ago 56.92us
record: cluster:namespace_container_terminated_count expr: label_join(sum by(namespace) (kube_pod_container_status_terminated), "computed", ",", "namespace") ok 8.598s ago 55.39us
record: cluster:namespace_container_restart:30m expr: label_join(sum by(namespace) (delta(kube_pod_container_status_restarts_total[30m])), "computed", ",", "namespace") ok 8.598s ago 78us
record: cluster:namespace_cpu_requests expr: label_join(sum by(namespace) (kube_pod_container_resource_requests_cpu_cores), "computed", ",", "namespace") ok 8.598s ago 41.34us
record: cluster:namespace_memory_requests expr: label_join(sum by(namespace) (kube_pod_container_resource_requests_memory_bytes), "computed", ",", "namespace") ok 8.598s ago 43.95us

cluster_namespace_deployments

5.642s ago

280.5us

Rule State Error Last Evaluation Evaluation Time
record: cluster:namespace_deployments_count expr: label_join(sum by(namespace) (kube_deployment_spec_replicas), "computed", ",", "namespace") ok 5.643s ago 128.6us
record: cluster:namespace_deplouments_update_count expr: label_join(sum by(namespace) (kube_deployment_status_replicas_updated), "computed", ",", "namespace") ok 5.642s ago 62.23us
record: cluster:namespace_deployments_unavailable_count expr: label_join(sum by(namespace) (kube_deployment_status_replicas_unavailable), "computed", ",", "namespace") ok 5.642s ago 38.6us
record: cluster:namespace_deployment_status_replicas expr: label_join(kube_deployment_status_replicas, "computed", ",", "namespace") ok 5.642s ago 43.36us

cluster_namespace_pods

2.592s ago

141.2us

Rule State Error Last Evaluation Evaluation Time
record: cluster:namespace_pods_status_count expr: label_join(sum by(namespace, phase) (kube_pod_status_phase), "computed", ",", "phase") ok 2.592s ago 135.9us

cluster_node

10.423s ago

245.8us

Rule State Error Last Evaluation Evaluation Time
record: node:resource_has_requests expr: label_join(sum by(node, resource) (kube_pod_container_resource_requests), "computed", "-", "node", "resource") ok 10.423s ago 147.5us
record: node:pods_count expr: label_join(sum by(node) (kube_pod_info), "computed", ",", "node") ok 10.423s ago 46.08us
record: node:resource_capacity expr: sum by(resource) (kube_node_status_capacity) ok 10.423s ago 33.95us

cluster_pods

10.031s ago

157us

Rule State Error Last Evaluation Evaluation Time
record: cluster:allocate_pods:count expr: label_join(sum by(job) (kube_node_status_allocatable_pods), "computed", ",", "job") ok 10.031s ago 150.4us

cluster_resources

14.379s ago

474us

Rule State Error Last Evaluation Evaluation Time
record: cluster:cpu_count expr: sum by(prometheus) (node:cpu_count) ok 14.379s ago 188.1us
record: cluster:pods_allocatable_count expr: sum by(prometheus) (kube_node_status_allocatable_pods) ok 14.379s ago 38.76us
record: cluster:pods_count expr: sum by(prometheus) (node:pods_count) ok 14.379s ago 28.68us
record: cluster:pods_usage expr: cluster:pods_count / sum by(prometheus) (cluster:allocate_pods:count) ok 14.379s ago 55.3us
record: cluster:resource_has_requests_ratio expr: sum by(resource) (node:resource_has_requests) / sum by(resource) (kube_node_status_capacity) ok 14.379s ago 47.2us
record: cluster:resource_gpu_used expr: sum(node:gpu_used) ok 14.379s ago 23.27us
record: cluster:resource_gpu_total expr: sum(node:gpu_allocate) ok 14.379s ago 47.1us
record: cluster:resource_gpu_usage expr: cluster:resource_gpu_used / cluster:resource_gpu_total ok 14.379s ago 33.57us

node

5.544s ago

11.71ms

Rule State Error Last Evaluation Evaluation Time
record: node:runtime:seconds expr: label_join(time() - node_boot_time_seconds, "computed", ",", "instance") ok 5.544s ago 271.7us
record: node:cpu_count expr: label_join(count by(instance) (count by(cpu, instance) (node_cpu_seconds_total{mode="system"})), "computed", ",", "instance") ok 5.544s ago 361.3us
record: node:cpu_usage_ratio:1m expr: label_join(1 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[1m]))), "computed", ",", "instance") ok 5.544s ago 370.4us
record: node:memory_usage_ratio expr: label_join(((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes)), "computed", ",", "instance") ok 5.543s ago 412.1us
record: node:memory_usage_size expr: label_join(node_memory_MemTotal_bytes - (node_memory_Cached_bytes + node_memory_Buffers_bytes + node_memory_MemFree_bytes), "computed", ",", "instance") ok 5.543s ago 308.3us
record: node:cpu_usage_ratio_system:1m expr: label_join(avg by(instance) (irate(node_cpu_seconds_total{mode="system"}[1m])), "computed", ",", "instance") ok 5.543s ago 381.8us
record: node:cpu_usage_ratio_user:1m expr: label_join(avg by(instance) (irate(node_cpu_seconds_total{mode="user"}[1m])), "computed", ",", "instance") ok 5.543s ago 336.4us
record: node:cpu_usage_ratio_iowait:1m expr: label_join(avg by(instance) (irate(node_cpu_seconds_total{mode="iowait"}[1m])), "computed", ",", "instance") ok 5.542s ago 254.9us
record: node:network_receive:5m expr: label_join(irate(node_network_receive_bytes_total[5m]) * 8, "computed", ",", "instance") ok 5.542s ago 1.929ms
record: node:network_transmit:5m expr: label_join(irate(node_network_transmit_bytes_total[5m]) * 8, "computed", ",", "instance") ok 5.54s ago 1.979ms
record: node:disk_read_iops:1m expr: label_join(irate(node_disk_reads_completed_total[1m]), "computed", ",", "instance") ok 5.538s ago 95.42us
record: node:disk_write_iops:1m expr: label_join(irate(node_disk_writes_completed_total[1m]), "computed", ",", "instance") ok 5.538s ago 53.04us
record: node:disk_read:1m expr: label_join(irate(node_disk_read_bytes_total[1m]), "computed", ",", "instance") ok 5.538s ago 52.48us
record: node:disk_write:1m expr: label_join(irate(node_disk_written_bytes_total[1m]), "computed", ",", "instance") ok 5.538s ago 45.67us
record: node:read_disk_io_time:1m expr: label_join(irate(node_disk_read_time_seconds_total[1m]), "computed", ",", "instance") ok 5.538s ago 47.87us
record: node:write_disk_io_time:1m expr: label_join(irate(node_disk_write_time_seconds_total[1m]), "computed", ",", "instance") ok 5.538s ago 46.61us
record: node:disk_io_time:1m expr: label_join(irate(node_disk_io_time_seconds_total[1m]), "computed", ",", "instance") ok 5.538s ago 44.62us
record: node:tcp_activeopens:1m expr: label_join(irate(node_netstat_Tcp_ActiveOpens[1m]), "computed", ",", "instance") ok 5.538s ago 158.5us
record: node:tcp_passiveopens:1m expr: label_join(irate(node_netstat_Tcp_PassiveOpens[1m]), "computed", ",", "instance") ok 5.538s ago 158.6us
record: node:filesystem_usage_size expr: label_join(node_filesystem_size_bytes{fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{fstype=~"ext4|xfs"}, "computed", ",", "instance") ok 5.538s ago 1.311ms
record: node:filesystem_usage_ratio expr: label_join(1 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"}), "computed", ",", "instance") ok 5.537s ago 1.233ms
record: node:gpu_used expr: label_join(sum by(instance) (container_accelerator_memory_used_bytes{container_name!="deepomatic-shared-gpu-nvidia-device-plugin-ctr"}) / count by(instance) (container_accelerator_duty_cycle{container_name!="deepomatic-shared-gpu-nvidia-device-plugin-ctr"}), "computed", ",", "instance") ok 5.536s ago 111.3us
record: node:gpu_allocate expr: label_join(sum by(instance) (container_accelerator_memory_total_bytes{container_name!="deepomatic-shared-gpu-nvidia-device-plugin-ctr"}) / count by(instance) (container_accelerator_duty_cycle{container_name!="deepomatic-shared-gpu-nvidia-device-plugin-ctr"}), "computed", ",", "instance") ok 5.536s ago 108.5us
record: node:disk_read_iops:1m expr: label_join(irate(node_disk_reads_completed_total[1m]), "computed", ",", "instance") ok 5.536s ago 60.31us
record: node:cpu_usage_ratio:1m expr: label_join(1 - avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[1m])), "computed", ",", "instance") ok 5.536s ago 338.8us
record: node:load1 expr: label_join(node_load1, "computed", ",", "instance") ok 5.535s ago 170.4us
record: node:load5 expr: label_join(node_load5, "computed", ",", "instance") ok 5.535s ago 152.8us
record: node:load15 expr: label_join(node_load15, "computed", ",", "instance") ok 5.535s ago 144.8us
record: node:mem_total expr: label_join(node_memory_MemTotal_bytes, "computed", ",", "instance") ok 5.535s ago 117.9us
record: node:mem_free expr: label_join(node_memory_MemFree_bytes, "computed", ",", "instance") ok 5.535s ago 134.8us
record: node:mem_available expr: label_join(node_memory_MemAvailable_bytes, "computed", ",", "instance") ok 5.535s ago 129.8us
record: node:mem_cache expr: label_join(node_memory_Cached_bytes, "computed", ",", "instance") ok 5.535s ago 165.2us
record: node:mem_buffers expr: label_join(node_memory_Buffers_bytes, "computed", ",", "instance") ok 5.535s ago 139.5us
record: node:gpu_used expr: label_join(container_accelerator_memory_used_bytes{containenr_name="node-gpu-exporter"}, "computed", ",", "instance") ok 5.534s ago 47.55us

pods

1.956s ago

7.097ms

Rule State Error Last Evaluation Evaluation Time
record: pod:run_time:seconds expr: label_join(time() - kube_pod_created, "computed", ",", "pod") ok 1.956s ago 131.2us
record: pod:filesystem_usage:bytes expr: label_join(sum by(pod_name, namespace) (container_fs_usage_bytes{container_name!="POD",image!=""}), "computed", ",", "pod_name") ok 1.956s ago 2.002ms
record: pod:cpu_reuqest expr: label_join(sum by(pod, namespace) (kube_pod_container_resource_requests{resource="cpu"}), "computed", ",", "pod") ok 1.954s ago 106.3us
record: pod:memory_request expr: label_join(sum by(pod, namespace) (kube_pod_container_resource_requests{resource="memory"}), "computed", ",", "pod") ok 1.954s ago 64.07us
record: pod:cpu_usage expr: label_join(sum by(namespace, pod_name) (rate(container_cpu_usage_seconds_total{container_name!="POD",image!=""}[1m])), "computed", ",", "pod_name") ok 1.954s ago 2.772ms
record: pod:memory_usage_of_requests expr: label_join(sum by(pod_name) (container_memory_rss{container_name!="POD",image!=""}), "computed", ",", "pod_name") ok 1.952s ago 1.775ms
record: pod:gpu_smutil expr: label_join(sum without(container) (nvidia_gpu_process_smutil), "computed", ",", "pod_name") ok 1.95s ago 65.96us
record: pod:gpu_memutil expr: label_join(sum without(container) (nvidia_gpu_process_memutil), "computed", ",", "pod_name") ok 1.95s ago 42.61us
record: pod:gpu_decutil expr: label_join(sum without(container) (nvidia_gpu_process_decutil), "computed", ",", "pod_name") ok 1.95s ago 39.47us
record: pod:gpu_encutil expr: label_join(sum without(container) (nvidia_gpu_process_encutil), "computed", ",", "pod_name") ok 1.95s ago 37.11us
record: pod:gpu_graph expr: label_join(sum without(container) (nvidia_gpu_process_graph), "computed", ",", "pod_name") ok 1.95s ago 40.88us