Skip to content

feat(pipeline) migrate tidb nextgen job to gcp#4471

Merged
ti-chi-bot[bot] merged 7 commits intoPingCAP-QE:mainfrom
dillon-zheng:nextgen-migrate
Apr 7, 2026
Merged

feat(pipeline) migrate tidb nextgen job to gcp#4471
ti-chi-bot[bot] merged 7 commits intoPingCAP-QE:mainfrom
dillon-zheng:nextgen-migrate

Conversation

@dillon-zheng
Copy link
Copy Markdown
Contributor

@dillon-zheng dillon-zheng commented Apr 7, 2026

6 jobs replay pass:

pull_build_next_gen https://prow.tidb.net/jenkins/job/pingcap/job/tidb/job/pull_build_next_gen/5/ SUCCESS 2026-04-06 22:08:45
pull_integration_e2e_test_next_gen https://prow.tidb.net/jenkins/job/pingcap/job/tidb/job/pull_integration_e2e_test_next_gen/7/ SUCCESS 2026-04-07 11:21:49
pull_integration_realcluster_test_next_gen https://prow.tidb.net/jenkins/job/pingcap/job/tidb/job/pull_integration_realcluster_test_next_gen/4/ SUCCESS 2026-04-06 21:51:57
pull_mysql_client_test_next_gen https://prow.tidb.net/jenkins/job/pingcap/job/tidb/job/pull_mysql_client_test_next_gen/2/ SUCCESS 2026-04-07 11:16:47
pull_unit_test_next_gen https://prow.tidb.net/jenkins/job/pingcap/job/tidb/job/pull_unit_test_next_gen/11/ SUCCESS 2026-04-07 09:05:52
pull_mysql_test_next_gen https://prow.tidb.net/jenkins/job/pingcap/job/tidb/job/pull_mysql_test_next_gen/11/ FAILURE (optional: true, same err to IDC) 2026-04-07 14:08:32

Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary

This PR migrates the NextGen CI pipeline jobs for TiDB to Google Cloud Platform (GCP). The migration involves updating pipeline scripts, pod configurations, and Jenkins job definitions to use GCP-specific settings. The approach includes replacing legacy URLs and volume definitions with GCP-compatible alternatives, and adding temporary fixes for cache issues and flaky tests. While the migration is functional, the PR contains areas for improvement in maintainability, error handling, and best practices.


Critical Issues

  • Hardcoded Token Handling in OCI Authentication

    • File: pull_integration_e2e_test_next_gen/pipeline.groovy, lines 51-59.
    • Issue: The code retrieves an OAuth token from GCP metadata and directly pipes it into the oras login command. If the token retrieval fails or is invalid, this could cause unexpected behavior without proper error handling.
    • Suggested Fix:
      token=$(curl -fsSL -H 'Metadata-Flavor: Google' \
          'http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token' | jq -r '.access_token') || {
          echo "Failed to retrieve OAuth token"; exit 1;
      }
      if [[ -n "$token" && "$token" != "null" ]]; then
          echo "$token" | oras login "$registry_host" -u oauth2accesstoken --password-stdin
      else
          echo "Invalid token retrieved"; exit 1;
      }
    • Impact: Prevents silent failures during OCI authentication.
  • Temporary Hotfix Logic

    • Files: Various pipeline scripts (e.g., pull_build_next_gen/pipeline.groovy, pull_integration_realcluster_test_next_gen/pipeline.groovy).
    • Issue: Temporary fixes (e.g., sed commands for modifying cache URLs and Bazel configurations) are scattered across multiple pipeline stages. These lack clear expiry mechanisms or documentation to indicate when they should be removed.
    • Suggested Fix: Encapsulate these hotfixes in a centralized script with clear comments and a defined expiration mechanism. For example:
      # hotfix_cleanup.sh
      # Temporary fixes for GCP migration (to be removed by MM/YYYY).
      sed -i -E '/legacy-cache-urls/d' WORKSPACE DEPS.bzl
      sed -i 's/^check: check-bazel-prepare /check: /' Makefile || true
    • Impact: Improves maintainability and reduces redundancy.

Code Improvements

  • Volume Definitions in Pod YAML

    • Files: Multiple pod YAML files (e.g., pull_build_next_gen/pod.yaml, pull_unit_test_next_gen/pod.yaml).
    • Issue: Persistent volumes are replaced with ephemeral volumes without justification or fallback mechanisms if ephemeral storage is insufficient.
    • Suggested Fix: Add comments explaining the rationale for ephemeral storage and consider fallback mechanisms:
      volumes:
        - name: bazel-out-merged
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes:
                  - ReadWriteOnce
                resources:
                  requests:
                    storage: 150Gi
                storageClassName: hyperdisk-rwo
      # Fallback mechanism: Consider persistent volumes if ephemeral storage fails.
    • Impact: Ensures clarity and robustness of storage configurations.
  • Resource Allocation

    • Files: Pod YAML files (e.g., pull_unit_test_next_gen/pod.yaml).
    • Issue: Memory and CPU limits appear arbitrary and inconsistently scaled across jobs.
    • Suggested Fix: Conduct benchmarks to determine the optimal resource allocation and document findings in comments.
    • Impact: Prevents over-provisioning and potential resource waste in GCP.

Best Practices

  • Testing Coverage for Migration Logic

    • Files: Pipeline scripts (e.g., pull_mysql_client_test_next_gen/pipeline.groovy).
    • Issue: No evidence of automated tests for the migration-specific changes (e.g., cache URL updates, OCI authentication).
    • Suggested Fix: Add unit tests or validation steps to ensure migration logic works as intended. For example:
      stage('Validate Migration Changes') {
          steps {
              sh '''
              grep -E 'https://us-go.pkg.dev/pingcap-testing-account/go-proxy-remote' WORKSPACE || {
                  echo "Migration validation failed"; exit 1;
              }
              '''
          }
      }
    • Impact: Ensures reliability of migration changes.
  • Documentation

    • Files: Multiple pipeline files.
    • Issue: Many changes lack comments explaining why they are necessary (e.g., cache URL updates, resource adjustments).
    • Suggested Fix: Add comments to clarify the purpose of changes, particularly temporary fixes and resource definitions.
    • Example:
      // Temporary fix to clean legacy cache URLs for GCP migration.
      sed -i -E '/legacy-cache-urls/d' WORKSPACE DEPS.bzl
  • Consistency in Naming Conventions

    • Files: Example pull_mysql_test_next_gen/pipeline.groovy.
    • Issue: Environment variables like OCI_ARTIFACT_HOST are inconsistently defined across pipeline files.
    • Suggested Fix: Standardize environment variable naming and initialization across all jobs.

Summary of Changes

This PR is functional but requires improvements in error handling, maintainability, and documentation. Addressing critical issues with authentication and temporary fixes should be prioritized.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates several TiDB CI pipelines to GCP-hosted infrastructure. Key changes include updating container images to GCP registries, configuring GOPROXY, and optimizing resource allocations. A temporary "Hotfix" stage was added to various pipelines to clean up legacy Bazel cache URLs and adjust build settings. Additionally, volume configurations were simplified, transitioning bazel-rc from secrets to ConfigMaps and adopting ephemeral storage for Bazel outputs in some jobs. Feedback suggests ensuring the existence of specific subdirectories after mount point changes and extending the use of high-performance ephemeral storage (hyperdisk-rwo) to more pipelines for consistency and performance.

Comment on lines +46 to +57
sh '''#!/usr/bin/env bash
set -euxo pipefail

# Clean legacy cache/mirror URLs that are unstable on GCP workers.
for f in WORKSPACE DEPS.bzl; do
[ -f "$f" ] || continue
sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "$f"
done

# Avoid "check" targets re-writing legacy cache settings during migration replay.
sed -i 's/^check: check-bazel-prepare /check: /' Makefile || true
'''
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The mount point for bazel-out-merged was changed from /home/jenkins/.tidb/tmp to /home/jenkins/.tidb in the corresponding pod.yaml. Since the volume starts empty, the tmp subdirectory will not exist, which may cause build failures if the process expects it. It is recommended to create this directory in the hotfix stage, as seen in the pull_integration_realcluster_test_next_gen pipeline. This follows defensive coding practices for CI/CD scripts.

                    sh '''#!/usr/bin/env bash
                        set -euxo pipefail

                        # Clean legacy cache/mirror URLs that are unstable on GCP workers.
                        for f in WORKSPACE DEPS.bzl; do
                          [ -f "$f" ] || continue
                          sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "$f"
                        done

                        # Avoid "check" targets re-writing legacy cache settings during migration replay.
                        sed -i 's/^check: check-bazel-prepare /check: /' Makefile || true

                        # Ensure the expected directory exists after mount point change in pod.yaml
                        mkdir -p /home/jenkins/.tidb/tmp
                    '''
References
  1. In CI/CD shell scripts, prioritize defensive coding with fallback protection over minor optimizations that might increase risk.

- name: bazel-out-overlay
emptyDir: {}
- name: bazel-out-merged
emptyDir: {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the pull_build_next_gen job and to ensure high-performance I/O for Bazel operations on GCP, consider using an ephemeral volume with hyperdisk-rwo instead of a standard emptyDir. This also avoids potential DiskPressure issues on the node's root disk.

      ephemeral:
        volumeClaimTemplate:
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 150Gi
            storageClassName: hyperdisk-rwo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

- name: bazel-out-overlay
emptyDir: {}
- name: bazel-out-merged
emptyDir: {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Given the heavy Bazel usage in unit tests, using a standard emptyDir for bazel-out-merged may lead to performance bottlenecks or node disk pressure. It is recommended to use an ephemeral volume with hyperdisk-rwo for better performance and isolation, consistent with the pull_build_next_gen job.

    - name: bazel-out-merged
      ephemeral:
        volumeClaimTemplate:
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 200Gi
            storageClassName: hyperdisk-rwo

Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary

This PR migrates multiple CI/CD jobs for the TiDB NextGen pipeline from an internal infrastructure to GCP. The migration involves updating resource configurations, adapting pod specifications for Kubernetes, and modifying the pipeline scripts to account for the new environment (e.g., replacing legacy cache URLs and introducing hotfixes for Bazel dependencies). The approach is thorough, with extensive updates across pipeline scripts and pod configurations. However, several issues related to error handling, maintainability, and best practices need to be addressed to ensure the reliability and scalability of the pipeline.


Critical Issues

  1. Error Handling in Hotfix Scripts

    • File: pipeline.groovy across multiple jobs
    • Lines: Multiple (see sections with sh commands modifying Bazel configurations)
    • Issue: The Bash scripts modifying Bazel configurations assume the presence of certain files (WORKSPACE, DEPS.bzl, .bazelrc) and do not handle missing files gracefully. This could lead to pipeline failures during execution.
    • Suggested Solution: Add explicit checks for file existence and fail gracefully with informative error messages if critical files are missing. Example:
      if [ ! -f "$f" ]; then
          echo "Error: $f not found. Ensure the file exists before proceeding."
          exit 1
      fi
  2. Potential Security Issue with sed Usage

    • File: Multiple pipeline scripts
    • Lines: Sections modifying Bazel configurations (e.g., .bazelrc, Makefile)
    • Issue: The sed commands are modifying critical configuration files without verifying the correctness of input parameters or changes. This could inadvertently introduce incorrect configurations.
    • Suggested Solution: Validate inputs before making changes and log the modifications for traceability. Additionally, consider using tools like git diff to verify changes after the script execution.

Code Improvements

  1. Hardcoded Resource Limits in Pod Configurations

    • File: Multiple pod.yaml files
    • Lines: Resource limits and requests (e.g., memory, CPU, ephemeral storage configurations)
    • Issue: Resource configurations are hardcoded and may not adapt well to varying workloads or future scaling needs.
    • Suggested Solution: Use environment variables or configuration files to dynamically set resource limits based on workload requirements.
  2. Unnecessary Code Duplication in Hotfix Logic

    • File: Multiple pipeline scripts
    • Lines: Hotfix sections for Bazel dependencies across jobs
    • Issue: The same logic for cleaning legacy cache URLs and disabling remote cache usage is repeated across multiple jobs.
    • Suggested Solution: Extract common hotfix logic into a shared script or function and call it from the pipeline scripts. Example:
      # shared_hotfix.sh
      clean_bazel_cache() {
          # Hotfix logic
      }
  3. Lack of Logging for Hotfixes

    • File: pipeline.groovy across multiple jobs
    • Lines: Sections modifying Bazel configurations
    • Issue: The hotfix scripts do not log their actions, making debugging difficult in case of failures.
    • Suggested Solution: Add logging statements to capture the actions performed, e.g., echo "Modified WORKSPACE file: removed cache URLs".

Best Practices

  1. Environment Variable Management

    • File: Multiple pod.yaml files
    • Lines: Sections introducing GOPROXY and other environment variables
    • Issue: Hardcoded environment variables make it difficult to manage different environments (e.g., staging vs. production).
    • Suggested Solution: Use Kubernetes ConfigMaps or Secrets to manage environment variables dynamically.
  2. Testing Coverage for Hotfixes

    • File: Multiple pipeline scripts
    • Lines: Sections introducing hotfix stages
    • Issue: There is no mention of automated testing for the hotfix logic.
    • Suggested Solution: Add unit tests or integration tests to validate the hotfix scripts in isolation before they are applied to the pipeline.
  3. Documentation for Migration Changes

    • File: N/A (general feedback)
    • Issue: The PR description is sparse and does not explain the rationale behind major changes (e.g., why certain cache URLs were removed or why ephemeral storage was increased).
    • Suggested Solution: Update the PR description to provide detailed explanations for significant changes, including their impact and expected benefits.
  4. Naming Conventions for Pod Labels

    • File: Multiple pod.yaml files
    • Lines: Sections adding pod labels
    • Issue: Labels like master: "1" are unclear and may not provide meaningful context.
    • Suggested Solution: Use descriptive labels that clarify their purpose, e.g., role: "master-pipeline".

Conclusion

While the migration to GCP is comprehensive, the PR can benefit from improved error handling, reduction of code duplication, and adherence to best practices for maintainability and scalability. Addressing these issues will enhance the robustness and reliability of the pipeline in the new environment.

Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary:

This PR migrates the TiDB "next-gen" pipeline jobs to Google Cloud Platform (GCP). It introduces changes to the pipeline configurations, pod templates, and associated scripts to update cache URLs, adjust resource allocation, and configure authentication for Docker and Bazel. Overall, the changes are substantial but coherent, demonstrating a clear intention to adapt the pipeline to the new environment. However, some areas need improvement for better maintainability, robustness, and adherence to best practices.


Critical Issues:

  1. Hardcoded Cache URL Cleanup in Multiple Scripts:

    • File: pipeline.groovy across several jobs (e.g., pull_build_next_gen, pull_unit_test_next_gen, etc.)
    • Issue: URLs for cleaning legacy cache settings are hardcoded, making them prone to errors and difficult to manage if changes are needed across multiple files.
    • Suggested Solution: Extract these URLs into a configuration file or centralized environment variables to avoid duplication and simplify maintenance.
      def legacyCacheUrls = [
          'bazel-cache.pingcap.net:8080',
          'ats.apps.svc',
          'cache.hawkingrei.com',
          'mirror.bazel.build'
      ]
      legacyCacheUrls.each { url ->
          sh "sed -i -E '/${url}/d' $f"
      }
  2. Ephemeral Volumes Misconfigured:

    • File: pod.yaml across several jobs (e.g., pull_mysql_test_next_gen, pull_unit_test_next_gen)
    • Issue: Ephemeral volumes are used in place of persistent storage for Bazel data and repository cache, which might lead to data loss between job runs, increasing build times and inefficiency.
    • Suggested Solution: Use persistent volumes for Bazel data to ensure cache data persists across job runs.
      volumes:
        - name: bazel-data
          persistentVolumeClaim:
            claimName: bazel-data-pvc

Code Improvements:

  1. Unnecessary Bash Script Logic Duplication:

    • File: Various pipeline.groovy files (e.g., pull_build_next_gen, pull_unit_test_next_gen)
    • Issue: Temporary hotfix scripts for Bazel cache settings are repeated across multiple files.
    • Suggested Solution: Extract common logic into a shared library function or script, and call it from each pipeline.
      def cleanBazelCache(repoDir) {
          sh """#!/usr/bin/env bash
          set -euxo pipefail
          for f in WORKSPACE DEPS.bzl; do
            [ -f "$f" ] || continue
            sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "$f"
          done
          """
      }
      cleanBazelCache(REFS.repo)
  2. Resource Allocation Adjustments:

    • File: Pod templates across various jobs (e.g., pull_unit_test_next_gen, pull_mysql_client_test_next_gen)
    • Issue: Resource requests and limits are unevenly configured across jobs, potentially leading to inefficiencies or over-allocation in GCP environments.
    • Suggested Solution: Standardize resource allocation across jobs based on observed usage patterns or benchmarks.
      resources:
        requests:
          memory: 8Gi
          cpu: "4"
        limits:
          memory: 16Gi
          cpu: "8"

Best Practices:

  1. Missing Documentation for Temporary Hotfix Logic:

    • File: Scripts and pipeline.groovy files
    • Issue: Temporary migration scripts (e.g., cache URL cleanup) lack comments explaining their purpose and timeline for removal.
    • Suggested Solution: Add comments to clarify the temporary nature of these changes and plan for their removal.
      # Temporary cleanup of legacy cache URLs for GCP migration.
      # TODO: Remove this script after full migration is complete.
      sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "$f"
  2. Testing Coverage for Migration Changes:

    • Issue: The PR description lists job success/failure but lacks details about testing coverage for edge cases (e.g., handling missing cache files, ephemeral storage constraints).
    • Suggested Solution: Add dedicated tests or validation scripts to verify the migration is robust against edge cases, such as missing dependencies or storage issues.
  3. Environment Variable Naming Consistency:

    • File: Pod templates (pod.yaml)
    • Issue: Environment variable names like GOPROXY are consistent, but the values should be centralized across jobs to avoid misalignment.
    • Suggested Solution: Define shared environment variables in a global configuration and reference them in pod templates.
      env:
        - name: GOPROXY
          valueFrom:
            configMapKeyRef:
              name: global-config
              key: go-proxy

Summary of Suggested Changes:

  • Refactor hardcoded URLs and duplicated logic into centralized scripts or configuration.
  • Adjust resource allocation for efficiency and maintain consistency across jobs.
  • Use persistent storage for Bazel cache volumes to improve build reliability.
  • Add documentation and tests for temporary changes and edge cases.
  • Centralize environment variable management to avoid misalignment across job configurations.

Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary

This PR migrates the TiDB next-gen job pipelines to Google Cloud Platform (GCP), replacing legacy configurations with GCP-compatible settings. Changes include updating container images, modifying pipeline scripts to remove unstable cache URLs, and adapting volumes and resources for GCP infrastructure. The approach is largely straightforward, but introduces temporary hotfixes for migration replay, adjusts Kubernetes pod definitions, and updates CI pipeline configurations. Overall, the migration is functional but contains areas that could benefit from cleanup and optimization.


Critical Issues

  1. Temporary Hotfixes Persist in Production Code

    • Files: pipeline.groovy across multiple pipelines (e.g., pull_build_next_gen, pull_integration_realcluster_test_next_gen, etc.)
    • Issue: Temporary hotfixes, such as modifying Makefile and .bazelrc files, are embedded directly in the pipelines. This creates long-term maintenance risks and makes it harder to clean up later.
    • Suggestion:
      Extract hotfix logic into clearly versioned scripts or ensure these changes are gated with feature flags. Provide clear documentation for cleanup timelines.
      if (MIGRATION_REPLAY) {
          sh './scripts/hotfix_bazel_cache.sh'
      }
  2. Potential Resource Waste in Ephemeral Volumes

    • Files: pod.yaml across multiple pipelines
    • Issue: Ephemeral volumes are configured with large storageClassName: hyperdisk-rwo sizes (e.g., 150Gi, 200Gi), but it's unclear whether this is necessary for all jobs.
    • Suggestion: Audit actual disk usage during pipeline runs and scale down volume sizes based on real needs. Over-provisioning may increase costs unnecessarily.

Code Improvements

  1. Code Duplication Across Pipeline Scripts

    • Files: All pipeline.groovy files (e.g., pull_build_next_gen, pull_unit_test_next_gen, etc.)
    • Issue: Similar logic for hotfixes and preparation steps is duplicated across multiple scripts.
    • Suggestion: Refactor common pipeline logic into reusable shared functions or libraries to improve maintainability.
      def applyHotfixes(repoDir) {
          dir(repoDir) {
              sh '''
                  # Hotfix logic here
              '''
          }
      }
  2. Hardcoded Resource Requests and Limits

    • Files: pod.yaml across pipelines
    • Issue: Resource requests and limits (e.g., cpu, memory) are hardcoded and may not adapt well to future workloads or environments.
    • Suggestion: Parameterize resource configurations using environment variables or pipeline inputs to simplify tuning.
      resources:
        requests:
          memory: ${MEMORY_REQUEST: "24Gi"}
          cpu: ${CPU_REQUEST: "6"}

Best Practices

  1. Missing Documentation for Hotfixes

    • Files: Pipeline scripts and CI configurations (pipeline.groovy, pod.yaml)
    • Issue: Temporary hotfixes lack inline comments or documentation explaining their purpose and removal timelines.
    • Suggestion: Add comments with clear context and timeline for cleanup.
      // TODO: Remove this hotfix after migration is stable (expected Q3 2026).
  2. Testing Coverage for Migration Replay

    • Files: CI pipeline configurations (latest-presubmits-next-gen.yaml)
    • Issue: The pull_mysql_test_next_gen job has optional: true, which may make it harder to detect issues during migration replay.
    • Suggestion: Temporarily make critical replay jobs mandatory or add additional validation steps to ensure migration correctness.
  3. Style Inconsistencies in Shell Scripts

    • Files: Embedded shell scripts in pipeline.groovy
    • Issue: Some scripts use inconsistent formatting, such as mixing # comments with inline comments or inconsistent error handling (set -euxo pipefail vs. || true).
    • Suggestion: Standardize shell script formatting and error handling for readability and robustness.

Conclusion

While the migration is functional, there are structural and resource optimization opportunities to improve maintainability, scalability, and alignment with best practices. Addressing the critical issues and refactoring common logic will reduce technical debt and ensure a smoother transition to GCP.

Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary

This pull request migrates several CI pipeline jobs for the TiDB project to Google Cloud Platform (GCP). The approach involves updating Kubernetes pod specifications, pipeline scripts, and dependencies to align with GCP infrastructure requirements. While the changes are comprehensive and mostly functional, there are notable issues with maintainability, error handling, and adherence to best practices. Additional refinement is needed to improve long-term sustainability and robustness.


Critical Issues

  1. Hardcoded Hotfix Commands:

    • Files: pipeline.groovy across multiple jobs (e.g., pull_build_next_gen, pull_integration_realcluster_test_next_gen, etc.).
    • Lines: Various sed commands modifying WORKSPACE, DEPS.bzl, .bazelrc, and Makefile.
    • Issue: These hotfixes are hardcoded and temporary. If the underlying dependencies or files change, these commands may break or fail silently.
    • Solution: Abstract these commands into reusable functions or scripts stored in version control. Validate changes using tests to ensure the integrity of modifications.
    steps {
        sh '''
            ./scripts/fix_bazel_deps.sh
        '''
    }

    This method centralizes the logic and makes it easier to update or debug.

  2. Volume Management:

    • Files: Various pod.yaml files (e.g., pull_build_next_gen/pod.yaml, pull_integration_realcluster_test_next_gen/pod.yaml).
    • Issue: Persistent volumes are replaced with ephemeral volumes. While this simplifies the setup, it may lead to data loss during container restarts and debugging complexities.
    • Solution: Evaluate whether ephemeral volumes are appropriate for all use cases. Consider using persistent volumes with proper clean-up mechanisms for critical data.

Code Improvements

  1. Environment Variable Configuration:

    • Files: Multiple pod.yaml files.
    • Issue: The GOPROXY environment variable is set to a specific proxy. This configuration assumes all environments will have access to the specified proxy.
    • Solution: Add fallback values and document this dependency. For example:
      env:
        - name: GOPROXY
          value: "${GOPROXY:-https://proxy.golang.org,direct}"
  2. Resource Requests and Limits:

    • Files: Multiple pod.yaml files.
    • Issue: Resource allocations (CPU, memory, storage) are significantly increased without justification. This may cause inefficiencies in resource utilization.
    • Solution: Provide documentation or comments explaining why these values were chosen. Consider profiling the pipeline jobs to optimize resource requests.
  3. Redundant Code:

    • Files: Multiple pipeline.groovy files (e.g., pull_integration_realcluster_test_next_gen, pull_unit_test_next_gen).
    • Issue: Repeated logic for modifying Bazel dependencies and configurations.
    • Solution: Refactor common logic into shared scripts or Groovy libraries. For example:
      def fixBazelDeps(repoDir) {
          sh """
              cd ${repoDir}
              ./scripts/fix_bazel_deps.sh
          """
      }

Best Practices

  1. Testing Coverage:

    • Issue: No evidence of tests specifically validating the migration to GCP.
    • Solution: Add integration tests to ensure pipelines work as expected in GCP. Simulate GCP environments during test runs to verify compatibility.
  2. Documentation:

    • Issue: The PR lacks detailed documentation on the reason for migration and the implications of changes (e.g., storage class updates, resource limits).
    • Solution: Update the PR description with detailed explanations. For example:
      • Why are ephemeral volumes used?
      • Why are resource requests increased?
      • Why are certain dependencies removed from Bazel configurations?
  3. Style Consistency:

    • Issue: Inconsistent commenting and inline explanations across scripts.
    • Solution: Ensure all scripts follow a consistent style guide. Use comments to explain commands that modify critical files, especially sed commands.
  4. Naming Conventions:

    • Files: Groovy and YAML files.
    • Issue: Variable names like REFS.repo and POD_TEMPLATE_FILE are unclear without context.
    • Solution: Use more descriptive names, e.g., REPO_DIRECTORY or KUBERNETES_POD_TEMPLATE.

Suggested Follow-Up

  1. Refactor Scripts:
    Consolidate repeated logic into shared utilities or libraries to improve maintainability.

  2. Validation Pipeline:
    Create a dedicated pipeline to validate changes in the migration process, especially for Bazel dependencies.

  3. Performance Benchmarking:
    Profile pipelines in GCP to verify whether resource allocations are optimal.

  4. Final Cleanup:
    Plan for removing temporary hotfixes once the migration stabilizes and ensure long-term compatibility.

Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary

This pull request migrates several TiDB NextGen CI jobs to Google Cloud Platform (GCP), introducing updates to pipeline configurations, container specifications, and temporary fixes for legacy dependencies in Bazel files. The approach primarily involves modifying pipeline Groovy scripts, updating Kubernetes pod configurations, and addressing compatibility issues with external cache/mirror URLs. While the migration addresses key compatibility and resource allocation concerns, some areas require clarification, refactoring, and adherence to best practices.


Critical Issues

  1. Bazel Cache Configuration Hardcoding

    • File: pipelines/pingcap/tidb/latest/pull_build_next_gen/pipeline.groovy and multiple pipeline scripts.
    • Issue: Temporary fixes like hardcoded removal of cache URLs (sed commands) and disabling remote cache in .bazelrc are fragile and may lead to long-term maintainability issues.
    • Suggested Solution: Introduce a centralized configuration management system or environment variable to handle cache/mirror URLs dynamically. This ensures scalability and reduces hardcoding across multiple scripts.
    def bazelCacheURLs = [
        "bazel-cache.pingcap.net:8080",
        "ats.apps.svc",
        "cache.hawkingrei.com",
        "mirror.bazel.build"
    ]
    sh """
        for f in WORKSPACE DEPS.bzl; do
          [ -f "\$f" ] && sed -i -E '/${bazelCacheURLs.join('|')}/d' "\$f"
        done
    """
  2. Ephemeral Volume Overprovisioning Risks

    • File: Multiple pod YAML files (pod.yaml).
    • Issue: Volumes such as bazel-out-merged are allocated with large ephemeral storage (200Gi), which might strain GCP resources and increase costs.
    • Suggested Solution: Perform a detailed analysis of actual storage needs and apply quotas based on job requirements. Consider monitoring usage to adjust limits dynamically.

Code Improvements

  1. Repeated Temporary Fixes Across Pipelines

    • File: Multiple pipeline scripts (pull_build_next_gen/pipeline.groovy, pull_integration_realcluster_test_next_gen/pipeline.groovy, etc.).
    • Issue: Hotfixes for Bazel cache URLs and Makefile modifications are duplicated across several scripts.
    • Suggested Solution: Encapsulate common fixes into a reusable function or shared library, reducing duplication. Example:
    def applyTemporaryBazelFixes(repoPath) {
        sh """
            for f in WORKSPACE DEPS.bzl; do
              [ -f "\$repoPath/\$f" ] && sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "\$repoPath/\$f"
            done
            sed -i 's/^check: check-bazel-prepare /check: /' "\$repoPath/Makefile" || true
        """
    }
  2. Resource Requests and Limits

    • File: Pod YAML files (pull_mysql_client_test_next_gen/pod.yaml, pull_unit_test_next_gen/pod.yaml, etc.).
    • Issue: Resource requests and limits (e.g., CPU and memory) vary widely without clear justification.
    • Suggested Solution: Align resource allocation with job requirements using benchmarks and adjust dynamically based on historical data. Avoid over-allocation by defining reasonable baselines.

Best Practices

  1. Testing and Validation of Bazel Fixes

    • File: Multiple pipeline scripts.
    • Issue: Temporary Bazel fixes are applied without automated validation to ensure they don't inadvertently break functionality.
    • Suggested Solution: Add unit tests or validation scripts to verify the integrity of Bazel files after modifications.
    # Example validation script
    if grep -qE 'bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build' WORKSPACE DEPS.bzl; then
        echo "Error: Legacy URLs still exist in Bazel files."
        exit 1
    fi
  2. Documentation for Temporary Fixes

    • File: Pipeline scripts (pipeline.groovy).
    • Issue: Temporary fixes lack inline comments or documentation explaining their rationale and timeframe for removal.
    • Suggested Solution: Add comments specifying why these fixes are temporary and when they should be revisited.
    // Temporary Bazel hotfix: Remove unstable cache URLs (to be removed after migration completes).
    sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' WORKSPACE
  3. Consistent Naming and Labels

    • File: prow-jobs/pingcap/tidb/latest-presubmits-next-gen.yaml
    • Issue: New labels (e.g., master: "1") are introduced without clear explanation or consistency across jobs.
    • Suggested Solution: Standardize label usage and document their purpose in YAML files or related documentation.

By addressing these issues, the migration can achieve better maintainability, scalability, and alignment with best practices.

# value: /share/.go
# - name: GOCACHE
# value: /share/.cache/go-build
ephemeral-storage: 150Gi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it make the pod can not be scheduled?

Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary

This PR migrates multiple CI/CD jobs for the TiDB project to run on Google Cloud Platform (GCP). It includes updates to pipelines, pod configurations, and environment settings for compatibility with GCP infrastructure. The approach primarily involves replacing legacy configurations, updating resource allocations, and cleaning up cache-related paths. The changes are generally well-organized, but there are areas for improvement, particularly in error handling, performance considerations, and adherence to best practices.


Critical Issues

  1. Hardcoded Cache Cleanup Without Validation (Multiple Groovy Files)

    • Files: pull_build_next_gen/pipeline.groovy, pull_integration_realcluster_test_next_gen/pipeline.groovy, pull_unit_test_ddlv1.groovy, pull_unit_test_next_gen/pipeline.groovy
    • Issue: The sed commands for cleaning up legacy cache URLs and modifying Makefiles assume all files exist and are writable. If these assumptions are incorrect, the pipeline could silently fail or produce unexpected behavior.
    • Suggestion: Add existence checks before attempting file modifications and log errors explicitly. Example:
      sh '''#!/usr/bin/env bash
      for file in WORKSPACE DEPS.bzl Makefile; do
          if [ -f "$file" ]; then
              sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "$file"
          else
              echo "Warning: $file not found or inaccessible" >&2
          fi
      done
      '''
  2. Resource Allocation Inconsistencies (Pod Configurations)

    • Files: Multiple pod.yaml files, e.g., pull_build_next_gen/pod.yaml, pull_integration_realcluster_test_next_gen/pod.yaml
    • Issue: Resource requests and limits (e.g., memory, cpu) are inconsistent across jobs without justification. For instance, some pods request 24Gi memory while others request up to 48Gi without clear reasoning. Overprovisioning could lead to inefficiencies.
    • Suggestion: Standardize resource allocation and document any deviations. If different jobs have varying requirements, add comments explaining the rationale.
  3. Ephemeral Volumes Without Backup Strategy

    • Files: Multiple pod.yaml files, e.g., pull_build_next_gen/pod.yaml, pull_unit_test_next_gen/pod.yaml
    • Issue: Ephemeral volumes (emptyDir) are used for critical caching purposes without fallback mechanisms. If pod restarts occur, cached data will be lost, potentially slowing down builds/tests.
    • Suggestion: Use persistent volumes or implement fallback logic to rehydrate caches after pod restarts.

Code Improvements

  1. Temporary Hotfix Logic Should Be Modularized

    • Files: Multiple Groovy files (e.g., pull_integration_realcluster_test_next_gen/pipeline.groovy).
    • Issue: The hotfix logic for cleaning cache URLs and modifying Makefiles is duplicated across several pipeline files. This leads to code duplication and maintenance challenges.
    • Suggestion: Refactor the hotfix logic into a reusable shared library or function. Example:
      void cleanLegacyCache(String repoDir) {
          dir(repoDir) {
              sh '''#!/usr/bin/env bash
              for file in WORKSPACE DEPS.bzl Makefile; do
                  if [ -f "$file" ]; then
                      sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "$file"
                  fi
              done
              '''
          }
      }
  2. Redundant Volume Definitions

    • Files: Multiple pod.yaml files (e.g., pull_mysql_test_next_gen/pod.yaml).
    • Issue: Persistent volumes (persistentVolumeClaim) are replaced by ephemeral volumes without clear reasoning in some cases. This creates redundancy and inconsistency.
    • Suggestion: Consolidate volume definitions and document why ephemeral volumes are preferred. If persistent volumes are retired, update related comments.

Best Practices

  1. Missing Error Handling for Bash Commands

    • Files: Multiple Groovy files (e.g., pull_unit_test_ddlv1.groovy, pull_integration_e2e_test_next_gen/pipeline.groovy).
    • Issue: Bash scripts (sh) lack robust error handling for commands that could fail silently.
    • Suggestion: Use set -o pipefail and add explicit error messages where applicable.
  2. Testing Coverage for Hotfix Logic

    • Files: All affected pipelines
    • Issue: The temporary hotfix logic is applied without mechanisms to verify its correctness or impact on the jobs.
    • Suggestion: Add a validation step or assertions after applying the hotfix logic to ensure changes are effective.
  3. Unclear Comments and Documentation

    • Files: Multiple Groovy files and pod.yaml files.
    • Issue: Comments explaining changes (e.g., resource allocation, cache modifications) are sparse or non-existent. This makes it harder for future developers to understand the rationale behind decisions.
    • Suggestion: Add comments wherever significant changes are made, especially for resource configurations and temporary hotfixes.
  4. Labels in Prow Job Configuration

    • Files: latest-presubmits-next-gen.yaml
    • Issue: Labels like master: "1" are added without explanation. Their purpose is unclear and should be documented.
    • Suggestion: Add comments describing the significance of these labels.

Conclusion

While the migration is comprehensive, improvements in error handling, resource standardization, code modularity, and documentation will enhance maintainability and reliability. Addressing the critical issues around cache handling and resource allocation should be prioritized.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 7, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wuhuizuo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the lgtm label Apr 7, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 7, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-07 10:46:36.80916578 +0000 UTC m=+866802.014525837: ☑️ agreed by wuhuizuo.
  • 2026-04-07 10:56:57.246211107 +0000 UTC m=+867422.451571154: ✖️🔁 reset by dillon-zheng.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 7, 2026

New changes are detected. LGTM label has been removed.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 7, 2026

@dillon-zheng: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-replay-jenkins-pipelines ae138a8 link false /test pull-replay-jenkins-pipelines

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary

This PR migrates various CI jobs for the TiDB "next-gen" pipeline to Google Cloud Platform (GCP) by updating configuration files, pod specifications, and pipeline scripts. The approach includes replacing outdated cache URLs, modifying resource requests/limits, transitioning to GCP-specific container registries, and adding ephemeral volumes. While the migration is well-structured, there are several opportunities to improve maintainability, consistency, and error handling.


Critical Issues

  1. Unconditional Cache Removal

    • File: Multiple pipeline .groovy files (e.g., pull_build_next_gen/pipeline.groovy, pull_unit_test_ddlv1.groovy)
    • Line(s): Examples: 40, 51, 80 (across multiple files)
    • Issue: The script unconditionally removes legacy URLs (e.g., bazel-cache.pingcap.net) without checking if other valid URLs are present. This could cause unintended disruptions if GCP mirrors fail or are misconfigured.
    • Solution: Add validation before removal. Example:
      if grep -q 'bazel-cache.pingcap.net' "$file"; then
          sed -i -E '/bazel-cache[.]pingcap[.]net/d' "$file"
      fi
      Also, consider logging the URLs removed for debugging purposes.
  2. Hardcoded GOPROXY Value

    • File: Multiple pod yaml files (e.g., pull_build_next_gen/pod.yaml, pull_unit_test_next_gen/pod.yaml)
    • Line(s): Examples: Lines including GOPROXY env variable
    • Issue: The hardcoded GOPROXY URL does not fallback to default or other mirrors if unavailable. This could block builds in case of network issues.
    • Solution: Add a fallback mechanism, e.g., value: "https://us-go.pkg.dev/...|https://proxy.golang.org,direct|https://goproxy.cn"
  3. Missing Error Handling for Ephemeral Volume Creation

    • File: Pod specification files (e.g., pod.yaml for pull_unit_test_next_gen)
    • Line(s): Examples: volumeClaimTemplate sections
    • Issue: If ephemeral volumes fail to initialize (e.g., incorrect storageClassName), there is no retry or fallback mechanism.
    • Solution: Add validation or retries for volume provisioning during initialization by monitoring pod readiness in the pipeline.

Code Improvements

  1. Resource Allocation Optimization

    • File: Pod YAML files (e.g., pull_build_next_gen/pod.yaml, pull_unit_test_next_gen/pod.yaml)
    • Line(s): Examples: Memory and CPU requests/limits
    • Issue: CPU and memory limits are inconsistently allocated across jobs (e.g., 16 CPUs for golang in one job vs 6 CPUs in another). This may lead to inefficiency or underutilized resources.
    • Solution: Define consistent resource profiles based on expected workloads across jobs. For example:
      resources:
        requests:
          memory: 16Gi
          cpu: "8"
        limits:
          memory: 32Gi
          cpu: "16"
  2. Temporary Hotfix Logic

    • File: Multiple pipeline .groovy files (e.g., pull_integration_realcluster_test_next_gen/pipeline.groovy)
    • Line(s): Examples: Stages named Hotfix bazel deps/cache
    • Issue: Temporary hotfix logic (e.g., disabling remote cache) introduces technical debt and could disrupt long-term stability if not properly tracked.
    • Solution: Document the rationale for these hotfixes in comments and set up an automated cleanup task post-migration.

Best Practices

  1. Documentation Needs

    • File: All pipeline .groovy files
    • Issue: The Hotfix bazel deps/cache and similar stages lack comments explaining why the changes are necessary. This makes it harder for future maintainers to understand the context.
    • Solution: Add comments:
      // Temporary workaround for instability in legacy cache URLs during GCP migration.
      // Remove this stage after migration completion.
  2. Testing Coverage for Migration

    • File: Multiple pipeline .groovy files
    • Issue: There is no evidence of specific tests for validating the GCP migration (e.g., verifying ephemeral volumes, cache handling, or resource limits).
    • Solution: Add integration tests to verify resource provisioning and cache integrity:
      stage('Validation') {
          steps {
              sh '''
              kubectl get pod -n $K8S_NAMESPACE | grep <pod-name>
              ls -l /home/jenkins/.tidb/tmp
              '''
          }
      }
  3. Consistent Naming Conventions

    • File: latest-presubmits-next-gen.yaml
    • Line(s): Examples: Labels like master: "1"
    • Issue: Labels such as master: "1" are ambiguous and lack standardization.
    • Solution: Use descriptive labels, e.g., migration-phase: "completed".
  4. Redundant Code

    • File: Multiple pipeline .groovy files
    • Line(s): Examples: Script blocks for cache removal duplicated across files
    • Issue: Duplicated scripts for cache removal increase maintenance burden.
    • Solution: Refactor into a reusable function or shared script file.

By addressing these issues, the PR can improve its robustness, maintainability, and adherence to best practices.

@ti-chi-bot ti-chi-bot bot merged commit de4688c into PingCAP-QE:main Apr 7, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

2 participants