OCPBUGS-83786:OCPSTRAT-3082:TLS propagation test#30801
OCPBUGS-83786:OCPSTRAT-3082:TLS propagation test#30801openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
/pipeline required |
|
Scheduling required tests: |
|
/retest |
|
/test e2e-metal-ipi-ovn-ipv6 |
|
Scheduling required tests: |
|
/retest-required |
|
Job Failure Risk Analysis for sha: 21e1845
|
|
Scheduling required tests: |
|
/retest-required |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a new end-to-end TLS ObservedConfig test suite: registers the suite, imports the TLS extended tests, adds OWNERS, and implements per-target tests validating operator observedConfig, deployment TLS env-vars, and wire-level TLS including a TLS profile change/restore flow. Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Suite
participant DynClient as Dynamic Client
participant ApiServer as Cluster APIServer
participant Deployment as Target Deployment
participant ClusterOp as ClusterOperator
Test->>DynClient: Fetch operator config (observedConfig)
DynClient-->>Test: observedConfig (minTLSVersion, cipherSuites)
Test->>Deployment: Inspect container env vars
Deployment-->>Test: Env var values
Test->>Deployment: Port-forward to service and perform TLS handshake checks
Deployment-->>Test: TLS handshake results
Test->>ApiServer: Update TLS profile (Modern) [config-change test]
ApiServer-->>Test: Profile updated
Test->>ClusterOp: Wait for related ClusterOperators to stabilize
ClusterOp-->>Test: Stable
Test->>ApiServer: Restore original TLS profile
ApiServer-->>Test: Profile restored
Test->>ClusterOp: Wait for stabilization
ClusterOp-->>Test: Stable
sequenceDiagram
participant Test as Test Suite
participant Validator as Validation Routines
participant RolloutMgr as Rollout Watcher
participant ClusterOp as ClusterOperators
Test->>Validator: Trigger per-target validations (observedConfig, env, wire)
Validator-->>Test: Validation results
Test->>RolloutMgr: Initiate rollout waits after profile change
RolloutMgr->>ClusterOp: Poll cluster operator statuses
ClusterOp-->>RolloutMgr: Report progress/completion
RolloutMgr-->>Test: Relevant operators stabilized
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/tls/tls_observed_config.go`:
- Around line 702-712: The TLS probes use tls.Dial (in the block handling
host/shouldWork and host/shouldNotWork) which can hang; replace calls to
tls.Dial with tls.DialWithDialer using a net.Dialer that sets a reasonable
Timeout (e.g., a few seconds) so both the successful probe and the
rejected-version probe use a dialer with explicit handshake timeout, preserve
the same error handling around conn, ConnectionState(), and conn.Close(), and
ensure the dialer timeout variable is reused for both DialWithDialer invocations
to avoid duplicated magic values.
- Around line 507-510: The namespace lookup currently treats any error as "not
found" and skips the test; change the error handling after
oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace,
metav1.GetOptions{}) to import k8s.io/apimachinery/pkg/api/errors as kapierrors
and only call g.Skip when kapierrors.IsNotFound(err) is true, otherwise fail the
test (e.g., call the test failure helper you use in this file) so
auth/transport/transient errors are not masked; apply the same change to the
second occurrence of the same oc.AdminKubeClient().CoreV1().Namespaces().Get
call in this file.
- Around line 654-676: Replace the unreliable stdout check after starting
port-forward (where e2e.StartCmdAndStreamOutput is used and
readPartialFrom(stdout, 1024) is called) with an active TCP readiness probe:
repeatedly attempt to open a TCP connection to 127.0.0.1:localPort until success
or a short deadline elapses, then proceed to call toExecute(localPort). Use the
existing ctx/cancel or a separate short timeout for the probe, keep defers for
stdout/stderr and TryKill(cmd), and ensure probe retries/backoff are short so
the test fails fast if the tunnel never becomes reachable.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (4)
pkg/testsuites/standard_suites.gotest/extended/include.gotest/extended/tls/OWNERStest/extended/tls/tls_observed_config.go
|
Scheduling required tests: |
|
/testwith openshift/origin/main/e2e-metal-ipi-serial-ovn-ipv6 openshift/cluster-openshift-apiserver-operator#657 openshift/cluster-image-registry-operator#1297 openshift/cluster-version-operator#1322 |
|
@gangwgr, |
There was a problem hiding this comment.
♻️ Duplicate comments (4)
test/extended/tls/tls_observed_config.go (4)
698-755:⚠️ Potential issue | 🟠 MajorAdd explicit handshake timeouts for TLS dial operations.
tls.Dialwithout deadlines can hang indefinitely under network edge cases, blocking test progression. This was flagged in a previous review. Usetls.DialWithDialerwith a timeout dialer.🔧 Proposed fix
func checkTLSConnection(localPort int, shouldWork, shouldNotWork *tls.Config, t tlsTarget) error { // Try both IPv4 and IPv6 localhost addresses for dual-stack support. hosts := []string{ fmt.Sprintf("127.0.0.1:%d", localPort), // IPv4 fmt.Sprintf("[::1]:%d", localPort), // IPv6 } var host string var conn *tls.Conn var err error + dialer := &net.Dialer{Timeout: 5 * time.Second} // Find a working host (try IPv4 first, then IPv6). for _, h := range hosts { - conn, err = tls.Dial("tcp", h, shouldWork) + conn, err = tls.DialWithDialer(dialer, "tcp", h, shouldWork) if err == nil { host = h break } // ... error handling ... } // ... // Test that the version that should not work is rejected. - conn, err = tls.Dial("tcp", host, shouldNotWork) + conn, err = tls.DialWithDialer(dialer, "tcp", host, shouldNotWork)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/tls/tls_observed_config.go` around lines 698 - 755, The TLS dials in checkTLSConnection can hang; change both tls.Dial calls to use tls.DialWithDialer with a net.Dialer that sets a timeout (e.g. Dialer{Timeout: 10*time.Second}) so the initial probe (inside the hosts loop using shouldWork) and the subsequent negative probe (using shouldNotWork) will fail fast instead of blocking indefinitely; locate tls.Dial invocations in checkTLSConnection and replace them with tls.DialWithDialer(&net.Dialer{Timeout: <reasonable duration>}, "tcp", hostOrH, tlsConfig).
506-510:⚠️ Potential issue | 🟠 MajorSkip only on NotFound errors; fail on other namespace lookup failures.
The current code treats any API error as "namespace doesn't exist" and skips, which masks real failures like auth errors or transient issues. This was flagged in a previous review.
🐛 Proposed fix
g.By(fmt.Sprintf("verifying namespace %s exists", t.namespace)) _, err := oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace, metav1.GetOptions{}) - if err != nil { + if kapierrors.IsNotFound(err) { g.Skip(fmt.Sprintf("Namespace %s does not exist in this cluster", t.namespace)) } + o.Expect(err).NotTo(o.HaveOccurred(), fmt.Sprintf("failed to query namespace %s", t.namespace))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/tls/tls_observed_config.go` around lines 506 - 510, The code currently skips on any error from oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace, metav1.GetOptions{}); change it to only skip when the error is a NotFound (use k8s api errors.IsNotFound(err) from "k8s.io/apimachinery/pkg/api/errors") and for any other non-nil error fail the test (use g.Fail or a test fatal with the error) so real API/auth/transient failures aren't masked; keep the existing g.Skip(fmt.Sprintf("Namespace %s does not exist...", t.namespace)) branch for IsNotFound and report the actual err in the failing branch.
648-683:⚠️ Potential issue | 🟠 MajorAdd active readiness probe for port-forward tunnel establishment.
The code starts
oc port-forwardand reads a chunk from stdout, but this doesn't guarantee the tunnel accepts connections. The 10-second context timeout also covers both tunnel setup and TLS test execution, which is too tight. This was flagged in a previous review.🔧 Proposed fix with TCP readiness probe and extended timeout
func forwardPortAndExecute(serviceName, namespace, remotePort string, toExecute func(localPort int) error) error { var err error for i := 0; i < 3; i++ { if err = func() error { - ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) defer cancel() - localPort := rand.Intn(65534-1025) + 1025 + localPort := rand.Intn(64510) + 1025 // [1025, 65534] args := []string{ "port-forward", fmt.Sprintf("svc/%s", serviceName), fmt.Sprintf("%d:%s", localPort, remotePort), "-n", namespace, } cmd := exec.CommandContext(ctx, "oc", args...) stdout, stderr, err := e2e.StartCmdAndStreamOutput(cmd) if err != nil { return fmt.Errorf("failed to start port-forward: %v", err) } defer stdout.Close() defer stderr.Close() defer e2e.TryKill(cmd) - // Give port-forward a moment to establish. - e2e.Logf("oc port-forward output: %s", readPartialFrom(stdout, 1024)) + // Wait until forwarded local port is reachable via active TCP probe. + deadline := time.Now().Add(5 * time.Second) + for { + conn, dialErr := net.DialTimeout("tcp", fmt.Sprintf("127.0.0.1:%d", localPort), 500*time.Millisecond) + if dialErr == nil { + _ = conn.Close() + break + } + if time.Now().After(deadline) { + return fmt.Errorf("port-forward did not become ready for svc/%s in %s: %w", serviceName, namespace, dialErr) + } + time.Sleep(200 * time.Millisecond) + } + e2e.Logf("oc port-forward established to svc/%s in %s on local port %d", serviceName, namespace, localPort) return toExecute(localPort) }(); err == nil {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/tls/tls_observed_config.go` around lines 648 - 683, forwardPortAndExecute currently starts oc port-forward with a single 10s context and only reads stdout (via readPartialFrom/StartCmdAndStreamOutput), which doesn't guarantee the tunnel accepts connections; change it to create a dedicated, longer context for the port-forward command (e.g., 30s) and perform an active TCP readiness probe against localhost:localPort (using net.Dial or net.DialTimeout) with several short retries and backoff to confirm the tunnel is accepting connections before calling toExecute; ensure you still defer stdout.Close(), stderr.Close(), and e2e.TryKill(cmd) on failure, and separate the port-forward lifecycle from the test execution (i.e., do not have the port-forward context encompass the entire toExecute run).
579-583:⚠️ Potential issue | 🟠 MajorSame issue: skip only on NotFound errors.
This has the same error handling issue as
testDeploymentTLSEnvVars— any error causes a skip rather than only NotFound errors.🐛 Proposed fix
g.By("verifying namespace exists") _, err = oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace, metav1.GetOptions{}) - if err != nil { + if kapierrors.IsNotFound(err) { g.Skip(fmt.Sprintf("Namespace %s does not exist in this cluster", t.namespace)) } + o.Expect(err).NotTo(o.HaveOccurred(), fmt.Sprintf("failed to query namespace %s", t.namespace))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/tls/tls_observed_config.go` around lines 579 - 583, The namespace existence check currently skips the test for any Get error; change it to only skip when the error is a NotFound error by using k8s.io/apimachinery/pkg/api/errors.IsNotFound(err) against the result of oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace, metav1.GetOptions{}), i.e. if apierrors.IsNotFound(err) then g.Skip(...) else if err != nil then fail the test (return or call the test failure helper used in these tests) so genuine API errors surface instead of being treated as missing namespace.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/extended/tls/tls_observed_config.go`:
- Around line 698-755: The TLS dials in checkTLSConnection can hang; change both
tls.Dial calls to use tls.DialWithDialer with a net.Dialer that sets a timeout
(e.g. Dialer{Timeout: 10*time.Second}) so the initial probe (inside the hosts
loop using shouldWork) and the subsequent negative probe (using shouldNotWork)
will fail fast instead of blocking indefinitely; locate tls.Dial invocations in
checkTLSConnection and replace them with tls.DialWithDialer(&net.Dialer{Timeout:
<reasonable duration>}, "tcp", hostOrH, tlsConfig).
- Around line 506-510: The code currently skips on any error from
oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace,
metav1.GetOptions{}); change it to only skip when the error is a NotFound (use
k8s api errors.IsNotFound(err) from "k8s.io/apimachinery/pkg/api/errors") and
for any other non-nil error fail the test (use g.Fail or a test fatal with the
error) so real API/auth/transient failures aren't masked; keep the existing
g.Skip(fmt.Sprintf("Namespace %s does not exist...", t.namespace)) branch for
IsNotFound and report the actual err in the failing branch.
- Around line 648-683: forwardPortAndExecute currently starts oc port-forward
with a single 10s context and only reads stdout (via
readPartialFrom/StartCmdAndStreamOutput), which doesn't guarantee the tunnel
accepts connections; change it to create a dedicated, longer context for the
port-forward command (e.g., 30s) and perform an active TCP readiness probe
against localhost:localPort (using net.Dial or net.DialTimeout) with several
short retries and backoff to confirm the tunnel is accepting connections before
calling toExecute; ensure you still defer stdout.Close(), stderr.Close(), and
e2e.TryKill(cmd) on failure, and separate the port-forward lifecycle from the
test execution (i.e., do not have the port-forward context encompass the entire
toExecute run).
- Around line 579-583: The namespace existence check currently skips the test
for any Get error; change it to only skip when the error is a NotFound error by
using k8s.io/apimachinery/pkg/api/errors.IsNotFound(err) against the result of
oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace,
metav1.GetOptions{}), i.e. if apierrors.IsNotFound(err) then g.Skip(...) else if
err != nil then fail the test (return or call the test failure helper used in
these tests) so genuine API errors surface instead of being treated as missing
namespace.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (4)
pkg/testsuites/standard_suites.gotest/extended/include.gotest/extended/tls/OWNERStest/extended/tls/tls_observed_config.go
🚧 Files skipped from review as they are similar to previous changes (2)
- test/extended/include.go
- test/extended/tls/OWNERS
|
Scheduling required tests: |
|
/testwith openshift/origin/main/e2e-metal-ipi-serial-ovn-ipv6 openshift/cluster-openshift-apiserver-operator#657 openshift/cluster-image-registry-operator#1297 openshift/cluster-version-operator#1322 |
|
/testwith openshift/origin/master/e2e-metal-ipi-serial-ovn-ipv6 openshift/cluster-openshift-apiserver-operator#657 openshift/cluster-image-registry-operator#1297 openshift/cluster-version-operator#1322 |
|
@gangwgr, |
|
@gangwgr, |
|
Scheduling required tests: |
|
Scheduling required tests: |
|
@gangwgr: This pull request references OCPSTRAT-3082 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the initiative to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test periodic-ci-openshift-tls-scanner-main-periodic-tls13-conformance |
|
/lgtm |
|
/assign @dgoodwin |
|
/verified by @gangwgr |
|
@gangwgr: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bertinatto, gangwgr, ricardomaraschini The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Scheduling required tests: |
dcd145e
into
openshift:main
|
/cherry-pick release-4.22 |
|
@gangwgr: new pull request created: #31032 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@gangwgr: Jira Issue Verification Checks: Jira Issue OCPBUGS-83786 Jira Issue OCPBUGS-83786 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.22 |
|
@gangwgr: new pull request created: #31038 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Test run -https://privatebin.corp.redhat.com/?b0f760b57d709f07#F2u5YDsSTXrxCVWyKze9Jj4bP424HE2ZfhR6HhEFRJ2Y
Summary by CodeRabbit
Tests
Chores