OPCT Review/Check Rules

The OPCT rules are used in the report command to evaluate the data collected by the OPCT execution. The HTML report will link directly to the rule ID on this page.

The rule details can be used as an additional resource in the review process.

The acceptance criteria for the rules are based on multiple CI jobs used as a reference to evaluate the expected result. If you have any questions about the rules, please file an Issue in the OPCT repository.

Rules

OPCT-001

Name: Kubernetes Conformance [10-openshift-kube-conformance] must pass 100%
Description: Kubernetes Conformance suite (defined as kubernetes/conformance in openshift-tests) implements e2e required by Kubernetes Certification. Those tests are base tests for an operational Kubernetes cluster. All tests must be passed prior reviewing OpenShift Conformance suite.
Action: Review the logs for each failed test in the Kubernetes conformance suite.

Expected:

 - 10-openshift-kube-conformance:
[...]
   - Failed (Filter SuiteOnly): 0 (0.00%)
   - Failed (Priority)        : 0 (0.00%)
   - Status After Filters     : passed

Troubleshoot: Review the High-Priority Failures:

$ /opct report archive.tar.gz
(..)
 => 10-openshift-kube-conformance: (2 failures, 0 flakes)

 --> Failed tests to Review (without flakes) - Immediate action:
[total=2] [sig-apps=1 (50.00%)] [sig-api-machinery=1 (50.00%)]

15  [sig-apps] Deployment deployment should support proportional scaling [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
6   [sig-api-machinery] Aggregator Should be able to support the 1.17 Sample API Server using the current Aggregator [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]

OPCT-002

Name: Plugin Conformance Upgrade [05-openshift-cluster-upgrade] must pass
Description: The cluster upgrade plugin must pass (or skip) the execution. The cluster upgrade plugin is responsible to schedule the upgrade conformance suite, which will upgrade the cluster while running conformance suite to monitor upgrade. This plugin is enabled when the execution mode is upgrade.
Action: Check the cluster upgrade logs (click under the test name in the job list). If the cause is undefined, re-run the execution or raise a question.
Expected: The cluster upgrade plugin must pass the execution when execution mode is upgrade.
Troubleshoot: Review the cluster upgrade logs and check the artifacts generated by the plugin.

OPCT-003

Name: Plugin Collector [99-openshift-artifacts-collector] must pass
Description: The Collector plugin is responsible to retrieve information from the cluster, including must-gather, etcd parsed logs, e2e test lists for conformance suites. It is expected the value of passed in the state, otherwise, the review flow will be impacted.
Action: Check the artifacts collector logs (click under the test name in the job list). If the cause is undefined, re-run the execution.
Expected: The artifacts collector plugin must pass the execution.
Troubleshoot: Review the artifacts collector logs and check the artifacts generated by the plugin:

Check the failed tests:

$ ./opct results -p 99-openshift-artifacts-collector archive.tar.gz

Check the plugin logs:

$ grep -B 5 'Creating failed JUnit' \
    podlogs/openshift-provider-certification/sonobuoy-99-*/logs/plugin.txt

OPCT-004

Name: OpenShift Conformance [20-openshift-conformance-validated]: Pass ratio must be >=98.5%
Description: OpenShift Conformance suite must not report a high number of failures in the base execution. Ideally, the lower is better, but the e2e tests are frequently being updated/improved fixing bugs and eventually, the tested release could be impacted by those issues. The reference of 1.5% error budget is a reference used as baseline from several executions in known platforms. Higher failure ratio could be related to errors in the tested environment, cluster configuration, and/or infrastructure issues. Check the test logs to isolate the issues. When applying to cluster validation with Red Hat teams, this check must be reviewed immediately before submitting the results as it is a potential problem in the infrastructure or misconfiguration. Review the OpenShift documentation for installing in agnostic platforms
Action: Check the failures section Test failures [high priority] and review the logs for each failed test.
Expected: Error budget lower than 1.5% of failed tests.
Troubleshoot:
Load the html report and navigate to the failures 1.A. Generate the html report
```
$ /opct report --save-to ./results archive.tar.gz
$ firefox http://localhost:8000
```
1.B. Review the logs for each failed test

OPCT-005

Name: OpenShift Conformance Validation [20]: Filter Priority Requirement >= 99.5%
Description: OpenShift Conformance suite must not report a high number of failures after applying filters. Ideally, the lower is better, but the e2e tests are frequently being updated/improved fixing bugs and eventually, the tested release could be impacted by those issues. The error budget higher than 0.5% could indicate issues in the tested environment. Higher failures could be related to errors in the tested environment. Check the test logs for OpenShift conformance suite, Priority section, to isolate the issues.
Action:
1. check the failures section Test failures [high priority]
2. review the logs for each failed test.
3. the remainging failures must be reviewed individually to achieve a successfull installation. Root cause of individual failures must be identified.
Expected: Error budget under acceptance criteria. Errors in the budget must be reviewed and the root cause identified.

OPCT-005B

Name: OpenShift Conformance Validation [20]: Required to Pass After Filters
Description: OpenShift Conformance suite must report passing after applying filters removing common/well-known issues.
Action: Check the failures section Test failures [high priority]. Dependencies must be passing prior to this check.
Dependencies: OPCT-004, OPCT-005

OPCT-010

Name: The cluster logs generate an accepted error budget
Description: The cluster logs, must-gather event logs, should generate fewer errors in the logs. The error budget is a metric that helps to understand the health of the cluster. The error counters are a relative value and are based on the observed values in CI executions in tested providers/platforms.
Action: Check the errors section in the report, explore the logs for each service in must-gather - using tools like omc, omg, grep (must-gather readers/explorers).
Expected: The error events in must-gather are a relative value and they are based on the observed values in known platforms.
Troubleshoot: Open the error events section in the report and review the rank of failed keywords, then check the rank by namespace and services for each failure. Error budgets help to focus on specific services that may contribute to the cluster failures.

To check the error counter by e2e test using HTML report navigate to Workload Errors in the left menu. The table Error Counters by Namespace shows the namespace reporting a high number of errors, ranked by the highest; you can start exploring the logs in that namespace.

The table Error Counters by Pod and Pattern in Workload Errors menu also report the pods you can use that information to isolate any issue in your environment.

To explore the logs, you can extract the must-gather collected by the plugin 99-openshift-artifacts-collector:

# extract must-gather from the results.
tar xfz artifact.tar.gz \
    plugins/99-openshift-artifacts-collector/results/global/artifacts_must-gather.tar.xz

# extract must-gather.
mkdir must-gather && \
tar xfJ plugins/99-openshift-artifacts-collector/results/global/artifacts_must-gather.tar.xz \
-C must-gather

# check workload logs with 'omc' (example etcd).
omc use must-gather
omc logs -n openshift-etcd etcd-control-plane-0 -c etcd

OPCT-010A

Name: etcd logs: slow requests: average should be under 500ms
Description: The etcd logs must generate the average of slow requests lower than 500 milliseconds. The slow requests are a metric that helps to understand the health of the etcd. The slow requests are a relative value and they are based on the observed values in known, and tested, cloud providers/platforms.
Action: Review if the storage volume for control plane nodes, or dedicated volume for etcd, has the required performance to run etcd in production environment.
Expected: The slow requests in etcd logs are a relative value and it is based on the observed values in known platforms.
Troubleshoot:

1) Review the documentation for the required storage for etcd:

2) Check the performance described in the article(B)

3) Review the processed values from your environment

Requirement

It is required to run a conformance validation in a new cluster.

The validation tests parses the etcd logs from the entire cluster, including historical data, if you changed the storage and didn't recreate the cluster, the results will include values containing slow requests from the old storage, impacting in the current view.

Run the report with debug flag --loglevel=debug:

(...)
DEBU[2023-09-25T12:52:05-03:00] Check OPCT-010 Failed Acceptance criteria: want=[<500] got=[690.412] 
DEBU[2023-09-25T12:52:05-03:00] Check OPCT-011 Failed Acceptance criteria: want=[<1000] got=[3091.49]

Extract the information from the logs using parser utility:

# Export the path of extracted must-gather. Example:
export MUST_GATHER_PATH=${PWD}/must-gather.local.2905984348081335046

# Run the utility
cat ${MUST_GATHER_PATH}/*/namespaces/openshift-etcd/pods/*/etcd/etcd/logs/current.log \
    | opct adm parse-etcd-logs --aggregator hour

# Or, use the must-gather path
opct adm parse-etcd-logs --aggregator hour --path ${MUST_GATHER_PATH}

References:

OPCT-010B

Name: etcd logs: slow requests: maximum should be under 1000ms
Description: The etcd logs must generate the maximum of slow requests lower than 1000 milisseconds. One or more requests with high latency could impact the cluster performance. Slow requests are a metric that helps to understand the health of the etcd. The slow requests are a relative value and it is based on the observed values in known platforms. The maximum value is the highest value of slow requests reported in the etcd logs, it must not be higher than 1 second.
Action: Review if the storage volume for control plane nodes, or dedicated volume for etcd, has the required performance to run etcd in production environment.
Expected: The slow requests in etcd logs are a relative value and it is based on the observed values in known platforms.
Troubleshoot: Review Dependencies: Troubleshooting section of OPCT-010A
Dependencies: OPCT-010A

OPCT-011

Name: The test suite generates accepted error budget
Description: The test suite generates accepted error budget. The error budget is a metric that helps to understand the health of the test suite. The error budget is the total number of errors that the test suite can generate before it is considered unreliable. The error budget is a relative value and it is based on the observed values in known platforms. To check the error counter by e2e test using HTML report navigate to Suite Errors in the left menu and table Tests by Error Pattern. To check the logs, navigate to the Plugin menu and check the logs failure and systemOut.
Action: Check the errors section in the report and resolve the log failures for the failed test.
Expected: The error budget is a relative value and it is based on the observed values in known platforms.
Troubleshoot: Open the error budget section in the report and review the logs for each failed test.

OPCT-020

Name: All nodes must be healthy
Description: All nodes must be healthy. The node health is a metric that helps to understand the health of the cluster.
Action: Check the node health section in the report and review the logs for each node.
Expected: All nodes must be healthy.
Troubleshoot: One or more nodes have been detected as unhealthy when the aggregator server collected the cluster state (end of job). Unhealthy nodes can cause test failures. This check can be used as a helper while investigating test failures. This check can be skipped if it is not causing failures in the conformance tests. Check the unhealthy nodes in the cluster:
```
$ omc get nodes
```
Review the node and events:
```
$ omc describe node <node_name>
```

OPCT-021

Name: Pods Healthy must report higher than 98%
Description: Pods Healthy must report higher than 98%. The pod health is a metric that helps to understand the health of some components. High pod health is a good indicator of the cluster health. The error budget of 2% is a reference used as a baseline from several executions in known platforms.
Action: Check the failing pod, and isolate if it is related to the environment and/or the validation tests.
Expected: Pods Healthy must report higher than 98%.

Troubleshoot: One or more pods have been detected as unhealthy when the aggregator server collected the cluster state (end of job). Run the CLI command opct results archive.tar.gz to review the failed pods. Explore the logs for each pods in must-gather available in the collector plugin. Check the unhealthy pods:

$ ./opct report archive.tar.gz
(...)
 Health summary:              [A=True/P=True/D=True]    
 - Cluster Operators            : [33/0/0]
 - Node health              : 6/6  (100.00%)
 - Pods health              : 246/247  (99.00%)

 Failed pods:
  Namespace/PodName                     Healthy Ready   Reason      Message
  openshift-kube-controller-manager/installer-6-control-plane-1 false   False   PodFailed   
(...)

Explore the pods:

$ omc get pods -A |egrep -v '(Running|Completed)'

OPCT-022

Name: Detected one or more plugin(s) with potential invalid result
Description: The plugin(s) must pass the execution, or generate valid results. The plugin(s) are responsible to execute the conformance test suites, and generate the report.
Action: Check the plugin logs (click under the test name in the job list). If the cause is undefined, re-run the execution
Expected: The plugin(s) must pass the execution, or generate valid results.
Troubleshoot: Review the plugin logs and check the artifacts generated by the plugin. Possible causes of failed plugins:
- The plugin is not able to execute the tests: Check the plugin logs for errors in the directory "plugins" in the report archive
- The plugin total counter is equal than the failed counter: Check the output of 'opct report' indicating the failed plugins

OPCT-023A

Name: Sanity [10-openshift-kube-conformance]: potential missing tests in suite
Description: The Kubernetes Conformance suite must have acceptable number of tests to be considered as a valid execution.
Action: This is unexpected for regular cluster validation. Check the plugin logs and the artifacts generated by the opct report to check if the job for Kubernetes Conformance suite have been completed.
Expected: The Kubernetes Conformance suite must have at least 300 tests to be valid. This number is based in the kubernetes conformance suite across different releases. This test is a sanity test to ensure that the plugin is running correctly.
Troubleshoot: Review the plugin logs and check the artifacts generated by the plugin.

OPCT-023B

Name: Sanity [20-openshift-conformance-validated]: potential missing tests in suite
Description: The OpenShift Conformance suite must have acceptable number of tests to be considered as a valid execution.
Action: Review the plugin logs and check the artifacts generated by the plugin.
Expected: The OpenShift Conformance suite must have at least 3000 tests to be valid. This number is based in the OpenShift conformance suite across different releases. This test is a sanity test to ensure that the plugin is running correctly.

OPCT-030

Name: Node Topology: ControlPlaneTopology HighlyAvailable must use multi-zone
Description: The control plane nodes must be distributed across multiple zones to ensure high availability.
Action: Check the control plane nodes and ensure that the nodes are distributed across multiple zones.
Expected: The control plane nodes must be distributed across multiple zones to ensure high availability.

OPCT-031A

Name: Infrastructure status must have valid Topology
Description: The infrastructure status must have Topology=HighlyAvailable to ensure that the cluster is in a healthy state.
Action: Check the infrastructure status logs (click under the test name in the job list). If the cause is undefined, re-run the execution and check the logs for errors.
Expected: The infrastructure status must have Topology=HighlyAvailable to ensure that the cluster is in a healthy state.
Troubleshoot: Review the infrastructure status logs and check the artifacts generated by the plugin.

OPCT-031B

Name: Infrastructure status must have valid ControlPlaneTopology
Description: The infrastructure status must have ControlPlaneTopology=HighlyAvailable to ensure that the cluster is in a healthy state.
Action: Check the infrastructure status logs (click under the test name in the job list). If the cause is undefined, re-run the execution and check the logs for errors.
Expected: The infrastructure status must have ControlPlaneTopology=HighlyAvailable to ensure that the cluster is in a healthy state.
Troubleshoot: Review the infrastructure status logs and check the artifacts generated by the plugin.

OPCT-031C

Name: Platform Type must be tested by OPCT
Description: The platform type must be supported by the OPCT tool to generate valid and tested reports. You can run the conformance tests in different platforms, but the OPCT results is tested with specific platforms, and the report is made and calibrated based in the tested platforms.
Action: Check the platform type in the report and ensure that the platform is supported by the OPCT tool.
Expected: The platform type must be supported by the OPCT tool to generate valid and tested reports.
Troubleshoot: Review the platform type in the report and check the artifacts generated by the plugin: oc get infrastructure

OPCT-032A

Name: Cluster Version Operator must be Available
Description: The Cluster Version Operator must be available to ensure that the cluster is in a healthy state.
Action: Check the Cluster Version Operator logs (click under the test name in the job list). If the cause is undefined, re-run the execution and check the logs for errors.
Expected: The Cluster Version Operator must be available to ensure that the cluster is in a healthy state.
Troubleshoot: Review the Cluster Version Operator logs and check the artifacts generated by the plugin.

OPCT-032B

Name: Cluster condition Failing must be False
Description: The Cluster condition Failing must be False to ensure that the cluster is in a healthy state.
Action: Check the Cluster condition Failing logs (click under the test name in the job list). If the cause is undefined, re-run the execution and check the logs for errors.
Expected: The Cluster condition Failing must be False to ensure that the cluster is in a healthy state.
Troubleshoot: Review the Cluster condition Failing logs and check the artifacts generated by the plugin.

OPCT-032C

Name: Cluster upgrade must not be Progressing
Description: The Cluster upgrade must not be Progressing to ensure that the cluster is in a healthy state.
Action: Check the Cluster upgrade logs (click under the test name in the job list). If the cause is undefined, re-run the execution and check the logs for errors.
Expected: The Cluster upgrade must not be Progressing to ensure that the cluster is in a healthy state.
Troubleshoot: Review the Cluster upgrade logs and check the artifacts generated by the plugin.

OPCT-032D

Name: Cluster Version ReleaseAccepted must be True
Description: The Cluster ReleaseAccepted must be True to ensure that the cluster is in a healthy state.
Action: Check the Cluster ReleaseAccepted logs (click under the test name in the job list). If the cause is undefined, re-run the execution and check the logs for errors.
Expected: The Cluster ReleaseAccepted must be True to ensure that the cluster is in a healthy state.
Troubleshoot: Review the Cluster ReleaseAccepted logs and check the artifacts generated by the plugin.

OPCT-032E

Name: Cluster Completion Time must report new installation
Description: The Cluster completion time in the history must report as a recent installation to be considered as a valid conformance execution.

Cluster installations reporting older than 1 day could impact in the telemetry/data collected when generating the report, consequently reporting false-positive items in the check rules, or e2e tests.

We strongly advice to run the conformance tests in a fresh cluster, and do not run twice - otherwise re-install and run the conformance tests again.

Acceptance criteria: - Warning: clusters installed older than 12 hours - Failed: clusters installed older than 1 day

Action: Install a cluster and run the conformance tests within one day.
Expected: Cluster reported as installed newer than one day from the time the conformance tests is reporting to finished.
Troubleshoot: Check the age of your cluster, completion time must be less than one day of the conformance test completion.

OPCT-040

Name: RHOCP Validation: cluster must be installed with supported platform type
Description: This check is valid only for partners applying to validation the provider in the Red Hat Connect Portal: The installation method used to install OpenShift must match the supported platform type of the Red Hat OpenShift validation requirements.
Action: Re-install a cluster with supported installation method and platform type.
Expected: The following matrix are valid when applying to the validation program:

Installation Method	Platform Type	Attributes
openshift-install	External	--
Assisted Installer	None*	User-managed Networking Mode
Assisted Installer	External	Only Oracle Cloud Infrastructure*
Agent-Based Installer	External	--

*platform type External will be the only one supported starting from OpenShift version 4.14.

For more details, refer to the Documentations: - OpenShift Documentation - OpenShift Provider Documentation. - Troubleshoot: Review the installation method and supported platform type.

Helper Rules Group

The following table describes how the check IDs are distributed.

ID	Description
00X[\|A-Z]	Conformance result rules
01X[\|A-Z]	Runtime, Infrastructure requirements, and known issues' rules
02X[\|A-Z]	Result archive anomaly detector's rules
03X[\|A-Z]	OpenShift object's rules
04X[\|A-Z]	Red Hat OpenShift validation program's rules

Page generated automatically by opct adm generate checks-docs