Installation Review
Note: this document is in constant update and provides a guidance to review the installed environment. It's always encorajed to review the product documentation first: docs.openshift.com.
This document complements the official page of "Installing a cluster on any platform" to review specific configurations and components after the cluster has been installed.
This document is also a helper for "OPCT - Installation Checklist" user document.
Compute
- Minimal required for Compute nodes: User Documentation -> Pre-requisites
Load Balancers
Review the Load Balancer requirements: Load balancing requirements for user-provisioned infrastructure
Review the Load Balancer size
The Load Balancer used by API must support a throughput higher than 100Mbp/s.
Reference:
Review the private Load Balancer
The basic OpenShift Installations with support of external Load Balancers deploy 3 Load Balancers: public and private for control plane services (Kubernetes API and Machine Config Server), and one public for the ingress.
The DNS or IP address for the private Load Balancer must point to the DNS record api-int.<cluster>.<domain>
, which will be accessed for internal services.
Reference: User-provisioned DNS requirements
Review Health Check configurations
The kube-apiserver has a graceful termination engine that requires the Load Balancer health check probe to the HTTP path.
Service | Protocol | Port | Path | Threshold | Interval | Timeout |
---|---|---|---|---|---|---|
Kubernetes API Server | HTTPS* | 6443 | /readyz | 2 | 10 | 10 |
Machine Config Server | HTTPS* | 22623 | /healthz | 2 | 10 | 10 |
Ingress | TCP | 80 | - | 2 | 10 | 10 |
Ingress | TCP | 443 | - | 2 | 10 | 10 |
Reminder for the API Load Balancer Health Check:
"The load balancer must be configured to take a maximum of 30 seconds from the time the API server turns off the /readyz endpoint to the removal of the API server instance from the pool. Within the time frame after /readyz returns an error or becomes healthy, the endpoint must have been removed or added. Probing every 5 or 10 seconds, with two successful requests to become healthy and three to become unhealthy, are well-tested values." Load balancing requirements for user-provisioned infrastructure
Review Hairpin Traffic
Hairpin traffic is when a backend node's traffic is load-balanced to itself. If this type of network traffic is dropped because your load balancer does not allow hairpin traffic, you need to provide a solution.
On the integrated clouds that do not support hairpin traffic, OpenShift provides a static pod to redirect traffic destined for the load balancer VIP back to the node on the kube-apiserver.
For Reference:
This is not a recommendation, any solution provided by you will not be supported by Red Hat.
- Static pods to redirect hairpin traffic for Azure
- Static pods to redirect hairpin traffic for AlibabaCloud
Steps to reproduce the Hairpin traffic to a node:
- deploy one sample pod
- add one service with a node port
- create the load balancer with the listener in any port. Example 80
- create the backend/target group pointing to the node port
- add the node which the pod is running to the LB/target group/backend nodes
- try to reach the load balancer IP/DNS through the pod
Components
etcd
Review etcd's disk speed requirements:
- etcd: Hardware recommendations
- OpenShift Docs: Planning your environment according to object maximums
- OpenShift KCS: Backend Performance Requirements for OpenShift etcd
- IBM: Using Fio to Tell Whether Your Storage is Fast Enough for Etcd
Review disk performance with etcd-fio
The KCS "How to Use 'fio' to Check Etcd Disk Performance in OCP" is a guide to check if the disk used by etcd has the expected performance on OpenShift.
Review etcd logs: etcd slow requests
This section provides a guide to check the etcd slow requests from the logs on the etcd pods to understand how the etcd is performing while running the e2e tests.
The steps below use a utility insights-ocp-etcd-logs
to parse the logs, aggregate the requests into buckets of 100ms from 200ms to 1s and report it on the stdout.
This is the utility to help you to troubleshoot the slow requests in your cluster, and help make some decisions like changing the flavor of the block device used by the control plane, increasing IOPS, changing the flavor of the instances, etc.
There's no magic or desired number, but for reference, based on the observations from integrated platforms, is to have no more than 30-40% of requests above 500ms while running the conformance tests.
TODO: provide guidance on how to get the errors from the etcd pods, and parse it into buckets of latency to understand the performance of the etcd while running the validated environment.
- Export the location you must-gather has been extracted:
export MUST_GATHER_PATH=${PWD}/must-gather.local.2905984348081335046
- Extract the utility from the tools repository:
This binary will be available when this card will be completed: https://issues.redhat.com/browse/SPLAT-857
oc image extract quay.io/ocp-cert/tools:latest --file="/usr/bin/ocp-etcd-log-filters"
chmod u+x ocp-etcd-log-filters
- Overall report:
Note: This report can not be usefull depending how old is the logs. We recommend looking at the next report which aggregates by the hour, so you can check the time frame the validation environment has been executed
$ cat ${MUST_GATHER_PATH}/*/namespaces/openshift-etcd/pods/*/etcd/etcd/logs/current.log \
| ./ocp-etcd-log-filters
> Filter Name: ApplyTookTooLong
> Group by: all
>>> Summary <<<
all 16949
>500ms 1485 (8.762 %)
---
>>> Buckets <<<
low-200 0 (0.000 %)
200-300 9340 (55.106 %)
300-400 4169 (24.597 %)
400-500 1853 (10.933 %)
500-600 716 (4.224 %)
600-700 223 (1.316 %)
700-800 185 (1.092 %)
800-900 139 (0.820 %)
900-1s 79 (0.466 %)
1s-inf 143 (0.844 %)
unkw 102 (0.602 %)
- Report aggregated by hour:
$ cat ${MUST_GATHER_PATH}/*/namespaces/openshift-etcd/pods/*/etcd/etcd/logs/current.log \
| ./ocp-etcd-log-filters -aggregator hour
> Filter Name: ApplyTookTooLong
> Group by: hour
>> 2023-03-01T17
>>> Summary <<<
all 558
>500ms 54 (9.677 %)
---
>>> Buckets <<<
low-200 0 (0.000 %)
200-300 385 (68.996 %)
300-400 90 (16.129 %)
400-500 28 (5.018 %)
500-600 9 (1.613 %)
600-700 10 (1.792 %)
700-800 7 (1.254 %)
800-900 9 (1.613 %)
900-1s 16 (2.867 %)
1s-inf 3 (0.538 %)
unkw 1 (0.179 %)
(...)
>> 2023-03-01T16
>>> Summary <<<
all 8651
>500ms 812 (9.386 %)
---
>>> Buckets <<<
low-200 0 (0.000 %)
200-300 4833 (55.866 %)
300-400 1972 (22.795 %)
400-500 983 (11.363 %)
500-600 328 (3.791 %)
600-700 135 (1.561 %)
700-800 111 (1.283 %)
800-900 75 (0.867 %)
900-1s 48 (0.555 %)
1s-inf 115 (1.329 %)
unkw 51 (0.590 %)
The values on the output are a reference for expected results: most of the slow requests reported on the logs (>=200ms) should be under 500 ms while the tests are executing.
Mount /var/lib/etcd in separate disk
One way to improve the performance on etcd is to use a dedicated block device.
You can mount /var/lib/etcd
by following the documentation:
Image Registry
You should be able to access the registry and make sure you can push and pull images on it, otherwise, the e2e tests will be reported as failed.
Please check the OpenShift documentation to validate it:
You can also explore the OpenShift sample projects that create PVC and BuildConfigs (which result in images being built and pushed to image registry). For example:
oc new-app nodejs-postgresql-persistent