OPCT Review/Check Rules
The OPCT rules are used in the report command to evaluate the data collected by the OPCT execution. The HTML report will link directly to the rule ID on this page.
The rule details can be used as an additional resource in the review process.
The acceptance criteria for the rules are based on multiple CI jobs used as a reference to evaluate the expected result. If you have any questions about the rules, please file an Issue in the OPCT repository.
Rules
OPCT-001
- Name: Kubernetes Conformance [10-openshift-kube-conformance] must pass 100%
- Description: Kubernetes Conformance suite (defined as
kubernetes/conformance
inopenshift-tests
) implements e2e required by Kubernetes Certification. Those tests are base tests for an operational Kubernetes cluster. All tests must be passed prior reviewing OpenShift Conformance suite. - Action: Review the logs for each failed test in the Kubernetes conformance suite.
- Expected:
- 10-openshift-kube-conformance: [...] - Failed (Filter SuiteOnly): 0 (0.00%) - Failed (Priority) : 0 (0.00%) - Status After Filters : passed
- Troubleshoot:
Review the High-Priority Failures:
$ /opct report archive.tar.gz (..) => 10-openshift-kube-conformance: (2 failures, 0 flakes) --> Failed tests to Review (without flakes) - Immediate action: [total=2] [sig-apps=1 (50.00%)] [sig-api-machinery=1 (50.00%)] 15 [sig-apps] Deployment deployment should support proportional scaling [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] 6 [sig-api-machinery] Aggregator Should be able to support the 1.17 Sample API Server using the current Aggregator [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
OPCT-002
- Name: Plugin Conformance Upgrade [05-openshift-cluster-upgrade] must pass
- Description: The cluster upgrade plugin must pass (or skip) the execution. The cluster upgrade plugin is responsible to schedule the upgrade conformance suite, which will upgrade the cluster while running conformance suite to monitor upgrade. This plugin is enabled when the execution mode is
upgrade
. - Action: Check the cluster upgrade logs (click under the test name in the job list). If the cause is undefined, re-run the execution or raise a question.
- Expected:
The cluster upgrade plugin must pass the execution when execution mode is
upgrade
. - Troubleshoot: Review the cluster upgrade logs and check the artifacts generated by the plugin.
OPCT-003
- Name: Plugin Collector [99-openshift-artifacts-collector] must pass
- Description: The Collector plugin is responsible to retrieve information from the cluster, including must-gather, etcd parsed logs, e2e test lists for conformance suites. It is expected the value of
passed
in the state, otherwise, the review flow will be impacted. - Action: Check the artifacts collector logs (click under the test name in the job list). If the cause is undefined, re-run the execution.
- Expected: The artifacts collector plugin must pass the execution.
- Troubleshoot: Review the artifacts collector logs and check the artifacts generated by the plugin:
-
Check the failed tests:
$ ./opct results -p 99-openshift-artifacts-collector archive.tar.gz
-
Check the plugin logs:
$ grep -B 5 'Creating failed JUnit' \ podlogs/openshift-provider-certification/sonobuoy-99-*/logs/plugin.txt
OPCT-004
- Name: OpenShift Conformance [20-openshift-conformance-validated]: Pass ratio must be >=98.5%
- Description: OpenShift Conformance suite must not report a high number of failures in the base execution. Ideally, the lower is better, but the e2e tests are frequently being updated/improved fixing bugs and eventually, the tested release could be impacted by those issues. The reference of 1.5% error budged is a reference used as basedline from several executions in known platforms. Higher failure ratio could be related to errors in the tested environment, cluster configuration, and/or infrastructure issues. Check the test logs to isolate the issues. When applying to cluster validation with Red Hat teams, this check must be reviewed immediately before submitting the results as it is a potential problem in the infrastructure or missconfiguration. Review the OpenShift documentation for installing in agnostic platforms
- Action: Check the failures section
Test failures [high priority]
and review the logs for each failed test. - Expected: Error budget lower than 1.5% of failed tests.
-
Troubleshoot:
-
Load the html report and navigate to the failures 1.A. Generate the html report
1.B. Review the logs for each failed test$ /opct report --save-to ./results archive.tar.gz $ firefox http://localhost:8000
OPCT-005
- Name: OpenShift Conformance Validation [20]: Filter Priority Requirement >= 99.5%
- Description: OpenShift Conformance suite must not report a high number of failures after applying filters. Ideally, the lower is better, but the e2e tests are frequently being updated/improved fixing bugs and eventually, the tested release could be impacted by those issues. The error budget higher than 0.5% could indicate issues in the tested environment. Higher failures could be related to errors in the tested environment. Check the test logs for OpenShift conformance suite, Priority section, to isolate the issues.
-
Action:
- check the failures section
Test failures [high priority]
- review the logs for each failed test.
- the remainging failures must be reviewed individually to achieve a successfull installation. Root cause of individual failures must be identified.
- check the failures section
-
Expected: Error budget under acceptance criteria. Errors in the budget must be reviewed and root cause identified.
OPCT-005B
- Name: OpenShift Conformance Validation [20]: Required to Pass After Filters
- Description: OpenShift Conformance suite must report passing after applying filters removing common/well-known issues.
- Action: Check the failures section
Test failures [high priority]
. Dependencies must be passing prior this check. - Dependencies: OPCT-004, OPCT-005
OPCT-010
- Name: The cluster logs generates accepted error budget
- Description: The cluster logs, must-gather event logs, should generate fewer error in the logs. The error budget are a metric that helps to isolate the health of the cluster. The error counters are a relative value and it is based on the observed values in CI executions in tested providers/platforms.
- Action: Check the errors section in the report, explore the logs for each service in must-gather - using tools like omc, omg, grep, etc (must-gather readers/explorer).
- Expected: The error events in must-gather are a relative value and it is based on the observed values in known platforms.
- Troubleshoot: Open the error events section in the report and review the rank of failed keywords, then check the rank by namespace and services for each failure. Error budgets helps to focus in specific services that may contribute to the cluster failures.
To check the error counter by e2e test using HTML report navigate to Workload Errors
in the left menu.
The table Error Counters by Namespace
shows the namespace reporting a high number of errors, rank by the higher,
you can start exploring the logs in that namespace.
The table Error Counters by Pod and Pattern
in Workload Errors
menu also report the pods
you also can use that information to isolate any issue in your environment.
To explore the logs, you can extract the must-gather collected by the plugin 99-openshift-artifacts-collector
:
# extract must-gather from the results
tar xfz artifact.tar.gz \
plugins/99-openshift-artifacts-collector/results/global/artifacts_must-gather.tar.xz
# extract must-gather
mkdir must-gather && \
tar xfJ plugins/99-openshift-artifacts-collector/results/global/artifacts_must-gather.tar.xz \
-C must-gather
# check workload logs with 'omc' (example etcd)
omc use must-gather
omc logs -n openshift-etcd etcd-control-plane-0 -c etcd
OPCT-010A
- Name: etcd logs: slow requests: average should be under 500ms
- Description: The etcd logs must generate the average of slow requests lower than 500 milisseconds. The slow requests are a metric that helps to understand the health of the etcd. The slow requests are a relative value and it is based on the observed values in known, and tested, cloud providers/platforms.
- Action: Review if the storage volume for control plane nodes, or dedicated volume for etcd, has the required performance to run etcd in production environment.
- Expected: The slow requests in etcd logs are a relative value and it is based on the observed values in known platforms.
- Troubleshoot:
1) Review the documentation for the required storage for etcd:
- A) Product Documentation
- B) Red Hat Article: Understanding etcd and the tunables/conditions affecting performance
- C) Red Hat Article: How to Use 'fio' to Check Etcd Disk Performance in OCP
- D) etcd-operator: baseline speed for standard hardware
2) Check the performance described in the article(B)
3) Review the processed values from your environment
Requirement
It is required to run a conformance validation in a new cluster.
The validation tests parses the etcd logs from the entire cluster, including historical data, if you changed the storage and didn't recreate the cluster, the results will include values containing slow requests from the old storage, impacting in the current view.
Run the report with debug flag --loglevel=debug
:
(...)
DEBU[2023-09-25T12:52:05-03:00] Check OPCT-010 Failed Acceptance criteria: want=[<500] got=[690.412]
DEBU[2023-09-25T12:52:05-03:00] Check OPCT-011 Failed Acceptance criteria: want=[<1000] got=[3091.49]
Extract the information from the logs using parser utility:
# Export the path of extracted must-gather. Example:
export MUST_GATHER_PATH=${PWD}/must-gather.local.2905984348081335046
# Run the utility
cat ${MUST_GATHER_PATH}/*/namespaces/openshift-etcd/pods/*/etcd/etcd/logs/current.log \
| opct adm parse-etcd-logs --aggregator hour
# Or, use the must-gather path
opct adm parse-etcd-logs --aggregator hour --path ${MUST_GATHER_PATH}
References:
- etcd: Hardware recommendations
- OpenShift Docs: Planning your environment according to object maximums
- OpenShift KCS: Backend Performance Requirements for OpenShift etcd
- IBM: Using Fio to Tell Whether Your Storage is Fast Enough for Etcd
OPCT-010B
- Name: etcd logs: slow requests: maximum should be under 1000ms
-
Description: The etcd logs must generate the maximum of slow requests lower than 1000 milisseconds. One or more requests with high latency could impact the cluster performance. Slow requests are a metric that helps to understand the health of the etcd. The slow requests are a relative value and it is based on the observed values in known platforms. The maximum value is the highest value of slow requests reported in the etcd logs, it must not be higher than 1 second.
-
Action: Review if the storage volume for control plane nodes, or dedicated volume for etcd, has the required performance to run etcd in production environment.
- Expected: The slow requests in etcd logs are a relative value and it is based on the observed values in known platforms.
- Troubleshoot: Review Dependencies: Troubleshooting section of OPCT-010A
- Dependencies: OPCT-010A
OPCT-011
- Name: The test suite generates accepted error budget
-
Description: The test suite generates accepted error budget. The error budget is a metric that helps to understand the health of the test suite. The error budget is the total number of errors that the test suite can generate before it is considered unreliable. The error budget is a relative value and it is based on the observed values in known platforms. To check the error counter by e2e test using HTML report navigate to
Suite Errors
in the left menu and tableTests by Error Pattern
. To check the logs, navigate to the Plugin menu and check the logsfailure
andsystemOut
. -
Action: Check the errors section in the report and resolve the log failures for the failed test.
- Expected: The error budget is a relative value and it is based on the observed values in known platforms.
- Troubleshoot: Open the error budget section in the report and review the logs for each failed test.
OPCT-020
- Name: All nodes must be healthy
- Description: All nodes must be healthy. The node health is a metric that helps to understand the health of the cluster.
- Action: Check the node health section in the report and review the logs for each node.
- Expected: All nodes must be healthy.
- Troubleshoot:
One or more nodes have been detected as unhealth when the aggregator server collected the cluster state (end of job).
Unhealth nodes can cause test failures. This check can be used as a helper while investigating test failures. This check can be skipped
if it is not causing failures in the conformance tests.
Check the unhealthy nodes in the cluster:
Review the node and events:
$ omc get nodes
$ omc describe node <node_name>
OPCT-021
- Name: Pods Healthy must report higher than 98%
- Description: Pods Healthy must report higher than 98%. The pod health is a metric that helps to understand the health of some components. High pod health is a good indicator of the cluster health. The error budget of 2% is a reference used as a baseline from several executions in known platforms.
- Action: Check the failing pod, and isolate if it is related to the environment and/or the validation tests.
- Expected: Pods Healthy must report higher than 98%.
- Troubleshoot:
One or more pods have been detected as unhealth when the aggregator server collected the cluster state (end of job).
Run the CLI command
opct results archive.tar.gz
to review the failed pods. Explore the logs for each pods in must-gather available in the collector plugin. Check the unhealthy pods:Explore the pods:$ ./opct report archive.tar.gz (...) Health summary: [A=True/P=True/D=True] - Cluster Operators : [33/0/0] - Node health : 6/6 (100.00%) - Pods health : 246/247 (99.00%) Failed pods: Namespace/PodName Healthy Ready Reason Message openshift-kube-controller-manager/installer-6-control-plane-1 false False PodFailed (...)
$ omc get pods -A |egrep -v '(Running|Completed)'
OPCT-022
- Name: Detected one or more plugin(s) with potential invalid result
- Description: The plugin(s) must pass the execution, or generate valid results. The plugin(s) are responsible to execute the conformance test suites, and generate the report.
- Action: Check the plugin logs (click under the test name in the job list). If the cause is undefined, re-run the execution
- Expected: The plugin(s) must pass the execution, or generate valid results.
- Troubleshoot:
Review the plugin logs and check the artifacts generated by the plugin.
Possible causes of failed plugins:
- The plugin is not able to execute the tests: Check the plugin logs for errors in the directory "plugins" in the report archive
- The plugin total counter is equal than the failed counter: Check the output of 'opct report' indicating the failed plugins
OPCT-023A
- Name: Sanity [10-openshift-kube-conformance]: potential missing tests in suite
- Description: The Kubernetes Conformance suite must have acceptable number of tests to be considered as a valid execution.
- Action: This is unexpected for regular cluster validation. Check the plugin logs and the artifacts generated by the
opct report
to check if the job for Kubernetes Conformance suite have been completed. - Expected: The Kubernetes Conformance suite must have at least 300 tests to be valid. This number is based in the kubernetes conformance suite across different releases. This test is a sanity test to ensure that the plugin is running correctly.
- Troubleshoot: Review the plugin logs and check the artifacts generated by the plugin.
OPCT-023B
- Name: Sanity [20-openshift-conformance-validated]: potential missing tests in suite
- Description: The OpenShift Conformance suite must have acceptable number of tests to be considered as a valid execution.
- Action: Review the plugin logs and check the artifacts generated by the plugin.
- Expected: The OpenShift Conformance suite must have at least 3000 tests to be valid. This number is based in the OpenShift conformance suite across different releases. This test is a sanity test to ensure that the plugin is running correctly.
OPCT-030
- Name: Node Topology: ControlPlaneTopology HighlyAvailable must use multi-zone
- Description: The control plane nodes must be distributed across multiple zones to ensure high availability.
- Action: Check the control plane nodes and ensure that the nodes are distributed across multiple zones.
- Expected: The control plane nodes must be distributed across multiple zones to ensure high availability.
Platform Type must be supported by OPCT
- Name: Platform Type must be supported by OPCT
- Description: The platform type must be supported by the OPCT tool to generate valid and tested reports. You can run the conformance tests in different platforms, but the OPCT results is tested with specific platforms, and the report is made and calibrated based in the tested platforms.
- Action: Check the platform type in the report and ensure that the platform is supported by the OPCT tool.
- Expected: The platform type must be supported by the OPCT tool to generate valid and tested reports.
- Troubleshoot: Review the platform type in the report and check the artifacts generated by the plugin: oc get infrastructure
Cluster Version Operator must be Available
- Name: Cluster Version Operator must be Available
- Description: The Cluster Version Operator must be available to ensure that the cluster is in a healthy state.
- Action: Check the Cluster Version Operator logs (click under the test name in the job list). If the cause is undefined, re-run the execution and check the logs for errors.
- Expected: The Cluster Version Operator must be available to ensure that the cluster is in a healthy state.
- Troubleshoot: Review the Cluster Version Operator logs and check the artifacts generated by the plugin.
Cluster condition Failing must be False
- Name: Cluster condition Failing must be False
- Description: The Cluster condition Failing must be False to ensure that the cluster is in a healthy state.
- Action: Check the Cluster condition Failing logs (click under the test name in the job list). If the cause is undefined, re-run the execution and check the logs for errors.
- Expected: The Cluster condition Failing must be False to ensure that the cluster is in a healthy state.
- Troubleshoot: Review the Cluster condition Failing logs and check the artifacts generated by the plugin.
Cluster upgrade must not be Progressing
- Name: Cluster upgrade must not be Progressing
- Description: The Cluster upgrade must not be Progressing to ensure that the cluster is in a healthy state.
- Action: Check the Cluster upgrade logs (click under the test name in the job list). If the cause is undefined, re-run the execution and check the logs for errors.
- Expected: The Cluster upgrade must not be Progressing to ensure that the cluster is in a healthy state.
- Troubleshoot: Review the Cluster upgrade logs and check the artifacts generated by the plugin.
Cluster ReleaseAccepted must be True
- Name: Cluster ReleaseAccepted must be True
- Description: The Cluster ReleaseAccepted must be True to ensure that the cluster is in a healthy state.
- Action: Check the Cluster ReleaseAccepted logs (click under the test name in the job list). If the cause is undefined, re-run the execution and check the logs for errors.
- Expected: The Cluster ReleaseAccepted must be True to ensure that the cluster is in a healthy state.
- Troubleshoot: Review the Cluster ReleaseAccepted logs and check the artifacts generated by the plugin.
Infrastructure status must have Topology=HighlyAvailable
- Name: Infrastructure status must have Topology=HighlyAvailable
- Description: The infrastructure status must have Topology=HighlyAvailable to ensure that the cluster is in a healthy state.
- Action: Check the infrastructure status logs (click under the test name in the job list). If the cause is undefined, re-run the execution and check the logs for errors.
- Expected: The infrastructure status must have Topology=HighlyAvailable to ensure that the cluster is in a healthy state.
- Troubleshoot: Review the infrastructure status logs and check the artifacts generated by the plugin.
Infrastructure status must have ControlPlaneTopology=HighlyAvailable
- Name: Infrastructure status must have ControlPlaneTopology=HighlyAvailable
- Description: The infrastructure status must have ControlPlaneTopology=HighlyAvailable to ensure that the cluster is in a healthy state.
- Action: Check the infrastructure status logs (click under the test name in the job list). If the cause is undefined, re-run the execution and check the logs for errors.
- Expected: The infrastructure status must have ControlPlaneTopology=HighlyAvailable to ensure that the cluster is in a healthy state.
- Troubleshoot: Review the infrastructure status logs and check the artifacts generated by the plugin.
Page generated automatically by opct adm generate checks-docs