feat(pipeline) migrate tidb nextgen job to gcp#4471
feat(pipeline) migrate tidb nextgen job to gcp#4471ti-chi-bot[bot] merged 7 commits intoPingCAP-QE:mainfrom
Conversation
There was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary
This PR migrates the NextGen CI pipeline jobs for TiDB to Google Cloud Platform (GCP). The migration involves updating pipeline scripts, pod configurations, and Jenkins job definitions to use GCP-specific settings. The approach includes replacing legacy URLs and volume definitions with GCP-compatible alternatives, and adding temporary fixes for cache issues and flaky tests. While the migration is functional, the PR contains areas for improvement in maintainability, error handling, and best practices.
Critical Issues
-
Hardcoded Token Handling in OCI Authentication
- File:
pull_integration_e2e_test_next_gen/pipeline.groovy, lines 51-59. - Issue: The code retrieves an OAuth token from GCP metadata and directly pipes it into the
oras logincommand. If the token retrieval fails or is invalid, this could cause unexpected behavior without proper error handling. - Suggested Fix:
token=$(curl -fsSL -H 'Metadata-Flavor: Google' \ 'http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token' | jq -r '.access_token') || { echo "Failed to retrieve OAuth token"; exit 1; } if [[ -n "$token" && "$token" != "null" ]]; then echo "$token" | oras login "$registry_host" -u oauth2accesstoken --password-stdin else echo "Invalid token retrieved"; exit 1; }
- Impact: Prevents silent failures during OCI authentication.
- File:
-
Temporary Hotfix Logic
- Files: Various pipeline scripts (e.g.,
pull_build_next_gen/pipeline.groovy,pull_integration_realcluster_test_next_gen/pipeline.groovy). - Issue: Temporary fixes (e.g.,
sedcommands for modifying cache URLs and Bazel configurations) are scattered across multiple pipeline stages. These lack clear expiry mechanisms or documentation to indicate when they should be removed. - Suggested Fix: Encapsulate these hotfixes in a centralized script with clear comments and a defined expiration mechanism. For example:
# hotfix_cleanup.sh # Temporary fixes for GCP migration (to be removed by MM/YYYY). sed -i -E '/legacy-cache-urls/d' WORKSPACE DEPS.bzl sed -i 's/^check: check-bazel-prepare /check: /' Makefile || true
- Impact: Improves maintainability and reduces redundancy.
- Files: Various pipeline scripts (e.g.,
Code Improvements
-
Volume Definitions in Pod YAML
- Files: Multiple pod YAML files (e.g.,
pull_build_next_gen/pod.yaml,pull_unit_test_next_gen/pod.yaml). - Issue: Persistent volumes are replaced with ephemeral volumes without justification or fallback mechanisms if ephemeral storage is insufficient.
- Suggested Fix: Add comments explaining the rationale for ephemeral storage and consider fallback mechanisms:
volumes: - name: bazel-out-merged ephemeral: volumeClaimTemplate: spec: accessModes: - ReadWriteOnce resources: requests: storage: 150Gi storageClassName: hyperdisk-rwo # Fallback mechanism: Consider persistent volumes if ephemeral storage fails.
- Impact: Ensures clarity and robustness of storage configurations.
- Files: Multiple pod YAML files (e.g.,
-
Resource Allocation
- Files: Pod YAML files (e.g.,
pull_unit_test_next_gen/pod.yaml). - Issue: Memory and CPU limits appear arbitrary and inconsistently scaled across jobs.
- Suggested Fix: Conduct benchmarks to determine the optimal resource allocation and document findings in comments.
- Impact: Prevents over-provisioning and potential resource waste in GCP.
- Files: Pod YAML files (e.g.,
Best Practices
-
Testing Coverage for Migration Logic
- Files: Pipeline scripts (e.g.,
pull_mysql_client_test_next_gen/pipeline.groovy). - Issue: No evidence of automated tests for the migration-specific changes (e.g., cache URL updates, OCI authentication).
- Suggested Fix: Add unit tests or validation steps to ensure migration logic works as intended. For example:
stage('Validate Migration Changes') { steps { sh ''' grep -E 'https://us-go.pkg.dev/pingcap-testing-account/go-proxy-remote' WORKSPACE || { echo "Migration validation failed"; exit 1; } ''' } }
- Impact: Ensures reliability of migration changes.
- Files: Pipeline scripts (e.g.,
-
Documentation
- Files: Multiple pipeline files.
- Issue: Many changes lack comments explaining why they are necessary (e.g., cache URL updates, resource adjustments).
- Suggested Fix: Add comments to clarify the purpose of changes, particularly temporary fixes and resource definitions.
- Example:
// Temporary fix to clean legacy cache URLs for GCP migration. sed -i -E '/legacy-cache-urls/d' WORKSPACE DEPS.bzl
-
Consistency in Naming Conventions
- Files: Example
pull_mysql_test_next_gen/pipeline.groovy. - Issue: Environment variables like
OCI_ARTIFACT_HOSTare inconsistently defined across pipeline files. - Suggested Fix: Standardize environment variable naming and initialization across all jobs.
- Files: Example
Summary of Changes
This PR is functional but requires improvements in error handling, maintainability, and documentation. Addressing critical issues with authentication and temporary fixes should be prioritized.
There was a problem hiding this comment.
Code Review
This pull request migrates several TiDB CI pipelines to GCP-hosted infrastructure. Key changes include updating container images to GCP registries, configuring GOPROXY, and optimizing resource allocations. A temporary "Hotfix" stage was added to various pipelines to clean up legacy Bazel cache URLs and adjust build settings. Additionally, volume configurations were simplified, transitioning bazel-rc from secrets to ConfigMaps and adopting ephemeral storage for Bazel outputs in some jobs. Feedback suggests ensuring the existence of specific subdirectories after mount point changes and extending the use of high-performance ephemeral storage (hyperdisk-rwo) to more pipelines for consistency and performance.
| sh '''#!/usr/bin/env bash | ||
| set -euxo pipefail | ||
|
|
||
| # Clean legacy cache/mirror URLs that are unstable on GCP workers. | ||
| for f in WORKSPACE DEPS.bzl; do | ||
| [ -f "$f" ] || continue | ||
| sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "$f" | ||
| done | ||
|
|
||
| # Avoid "check" targets re-writing legacy cache settings during migration replay. | ||
| sed -i 's/^check: check-bazel-prepare /check: /' Makefile || true | ||
| ''' |
There was a problem hiding this comment.
The mount point for bazel-out-merged was changed from /home/jenkins/.tidb/tmp to /home/jenkins/.tidb in the corresponding pod.yaml. Since the volume starts empty, the tmp subdirectory will not exist, which may cause build failures if the process expects it. It is recommended to create this directory in the hotfix stage, as seen in the pull_integration_realcluster_test_next_gen pipeline. This follows defensive coding practices for CI/CD scripts.
sh '''#!/usr/bin/env bash
set -euxo pipefail
# Clean legacy cache/mirror URLs that are unstable on GCP workers.
for f in WORKSPACE DEPS.bzl; do
[ -f "$f" ] || continue
sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "$f"
done
# Avoid "check" targets re-writing legacy cache settings during migration replay.
sed -i 's/^check: check-bazel-prepare /check: /' Makefile || true
# Ensure the expected directory exists after mount point change in pod.yaml
mkdir -p /home/jenkins/.tidb/tmp
'''
References
- In CI/CD shell scripts, prioritize defensive coding with fallback protection over minor optimizations that might increase risk.
| - name: bazel-out-overlay | ||
| emptyDir: {} | ||
| - name: bazel-out-merged | ||
| emptyDir: {} |
There was a problem hiding this comment.
For consistency with the pull_build_next_gen job and to ensure high-performance I/O for Bazel operations on GCP, consider using an ephemeral volume with hyperdisk-rwo instead of a standard emptyDir. This also avoids potential DiskPressure issues on the node's root disk.
ephemeral:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 150Gi
storageClassName: hyperdisk-rwo| - name: bazel-out-overlay | ||
| emptyDir: {} | ||
| - name: bazel-out-merged | ||
| emptyDir: {} |
There was a problem hiding this comment.
Given the heavy Bazel usage in unit tests, using a standard emptyDir for bazel-out-merged may lead to performance bottlenecks or node disk pressure. It is recommended to use an ephemeral volume with hyperdisk-rwo for better performance and isolation, consistent with the pull_build_next_gen job.
- name: bazel-out-merged
ephemeral:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
storageClassName: hyperdisk-rwoThere was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary
This PR migrates multiple CI/CD jobs for the TiDB NextGen pipeline from an internal infrastructure to GCP. The migration involves updating resource configurations, adapting pod specifications for Kubernetes, and modifying the pipeline scripts to account for the new environment (e.g., replacing legacy cache URLs and introducing hotfixes for Bazel dependencies). The approach is thorough, with extensive updates across pipeline scripts and pod configurations. However, several issues related to error handling, maintainability, and best practices need to be addressed to ensure the reliability and scalability of the pipeline.
Critical Issues
-
Error Handling in Hotfix Scripts
- File:
pipeline.groovyacross multiple jobs - Lines: Multiple (see sections with
shcommands modifying Bazel configurations) - Issue: The Bash scripts modifying Bazel configurations assume the presence of certain files (
WORKSPACE,DEPS.bzl,.bazelrc) and do not handle missing files gracefully. This could lead to pipeline failures during execution. - Suggested Solution: Add explicit checks for file existence and fail gracefully with informative error messages if critical files are missing. Example:
if [ ! -f "$f" ]; then echo "Error: $f not found. Ensure the file exists before proceeding." exit 1 fi
- File:
-
Potential Security Issue with
sedUsage- File: Multiple pipeline scripts
- Lines: Sections modifying Bazel configurations (e.g.,
.bazelrc, Makefile) - Issue: The
sedcommands are modifying critical configuration files without verifying the correctness of input parameters or changes. This could inadvertently introduce incorrect configurations. - Suggested Solution: Validate inputs before making changes and log the modifications for traceability. Additionally, consider using tools like
git diffto verify changes after the script execution.
Code Improvements
-
Hardcoded Resource Limits in Pod Configurations
- File: Multiple
pod.yamlfiles - Lines: Resource limits and requests (e.g., memory, CPU, ephemeral storage configurations)
- Issue: Resource configurations are hardcoded and may not adapt well to varying workloads or future scaling needs.
- Suggested Solution: Use environment variables or configuration files to dynamically set resource limits based on workload requirements.
- File: Multiple
-
Unnecessary Code Duplication in Hotfix Logic
- File: Multiple pipeline scripts
- Lines: Hotfix sections for Bazel dependencies across jobs
- Issue: The same logic for cleaning legacy cache URLs and disabling remote cache usage is repeated across multiple jobs.
- Suggested Solution: Extract common hotfix logic into a shared script or function and call it from the pipeline scripts. Example:
# shared_hotfix.sh clean_bazel_cache() { # Hotfix logic }
-
Lack of Logging for Hotfixes
- File:
pipeline.groovyacross multiple jobs - Lines: Sections modifying Bazel configurations
- Issue: The hotfix scripts do not log their actions, making debugging difficult in case of failures.
- Suggested Solution: Add logging statements to capture the actions performed, e.g.,
echo "Modified WORKSPACE file: removed cache URLs".
- File:
Best Practices
-
Environment Variable Management
- File: Multiple
pod.yamlfiles - Lines: Sections introducing
GOPROXYand other environment variables - Issue: Hardcoded environment variables make it difficult to manage different environments (e.g., staging vs. production).
- Suggested Solution: Use Kubernetes ConfigMaps or Secrets to manage environment variables dynamically.
- File: Multiple
-
Testing Coverage for Hotfixes
- File: Multiple pipeline scripts
- Lines: Sections introducing hotfix stages
- Issue: There is no mention of automated testing for the hotfix logic.
- Suggested Solution: Add unit tests or integration tests to validate the hotfix scripts in isolation before they are applied to the pipeline.
-
Documentation for Migration Changes
- File: N/A (general feedback)
- Issue: The PR description is sparse and does not explain the rationale behind major changes (e.g., why certain cache URLs were removed or why ephemeral storage was increased).
- Suggested Solution: Update the PR description to provide detailed explanations for significant changes, including their impact and expected benefits.
-
Naming Conventions for Pod Labels
- File: Multiple
pod.yamlfiles - Lines: Sections adding pod labels
- Issue: Labels like
master: "1"are unclear and may not provide meaningful context. - Suggested Solution: Use descriptive labels that clarify their purpose, e.g.,
role: "master-pipeline".
- File: Multiple
Conclusion
While the migration to GCP is comprehensive, the PR can benefit from improved error handling, reduction of code duplication, and adherence to best practices for maintainability and scalability. Addressing these issues will enhance the robustness and reliability of the pipeline in the new environment.
There was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary:
This PR migrates the TiDB "next-gen" pipeline jobs to Google Cloud Platform (GCP). It introduces changes to the pipeline configurations, pod templates, and associated scripts to update cache URLs, adjust resource allocation, and configure authentication for Docker and Bazel. Overall, the changes are substantial but coherent, demonstrating a clear intention to adapt the pipeline to the new environment. However, some areas need improvement for better maintainability, robustness, and adherence to best practices.
Critical Issues:
-
Hardcoded Cache URL Cleanup in Multiple Scripts:
- File:
pipeline.groovyacross several jobs (e.g.,pull_build_next_gen,pull_unit_test_next_gen, etc.) - Issue: URLs for cleaning legacy cache settings are hardcoded, making them prone to errors and difficult to manage if changes are needed across multiple files.
- Suggested Solution: Extract these URLs into a configuration file or centralized environment variables to avoid duplication and simplify maintenance.
def legacyCacheUrls = [ 'bazel-cache.pingcap.net:8080', 'ats.apps.svc', 'cache.hawkingrei.com', 'mirror.bazel.build' ] legacyCacheUrls.each { url -> sh "sed -i -E '/${url}/d' $f" }
- File:
-
Ephemeral Volumes Misconfigured:
- File:
pod.yamlacross several jobs (e.g.,pull_mysql_test_next_gen,pull_unit_test_next_gen) - Issue: Ephemeral volumes are used in place of persistent storage for Bazel data and repository cache, which might lead to data loss between job runs, increasing build times and inefficiency.
- Suggested Solution: Use persistent volumes for Bazel data to ensure cache data persists across job runs.
volumes: - name: bazel-data persistentVolumeClaim: claimName: bazel-data-pvc
- File:
Code Improvements:
-
Unnecessary Bash Script Logic Duplication:
- File: Various
pipeline.groovyfiles (e.g.,pull_build_next_gen,pull_unit_test_next_gen) - Issue: Temporary hotfix scripts for Bazel cache settings are repeated across multiple files.
- Suggested Solution: Extract common logic into a shared library function or script, and call it from each pipeline.
def cleanBazelCache(repoDir) { sh """#!/usr/bin/env bash set -euxo pipefail for f in WORKSPACE DEPS.bzl; do [ -f "$f" ] || continue sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "$f" done """ } cleanBazelCache(REFS.repo)
- File: Various
-
Resource Allocation Adjustments:
- File: Pod templates across various jobs (e.g.,
pull_unit_test_next_gen,pull_mysql_client_test_next_gen) - Issue: Resource requests and limits are unevenly configured across jobs, potentially leading to inefficiencies or over-allocation in GCP environments.
- Suggested Solution: Standardize resource allocation across jobs based on observed usage patterns or benchmarks.
resources: requests: memory: 8Gi cpu: "4" limits: memory: 16Gi cpu: "8"
- File: Pod templates across various jobs (e.g.,
Best Practices:
-
Missing Documentation for Temporary Hotfix Logic:
- File: Scripts and
pipeline.groovyfiles - Issue: Temporary migration scripts (e.g., cache URL cleanup) lack comments explaining their purpose and timeline for removal.
- Suggested Solution: Add comments to clarify the temporary nature of these changes and plan for their removal.
# Temporary cleanup of legacy cache URLs for GCP migration. # TODO: Remove this script after full migration is complete. sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "$f"
- File: Scripts and
-
Testing Coverage for Migration Changes:
- Issue: The PR description lists job success/failure but lacks details about testing coverage for edge cases (e.g., handling missing cache files, ephemeral storage constraints).
- Suggested Solution: Add dedicated tests or validation scripts to verify the migration is robust against edge cases, such as missing dependencies or storage issues.
-
Environment Variable Naming Consistency:
- File: Pod templates (
pod.yaml) - Issue: Environment variable names like
GOPROXYare consistent, but the values should be centralized across jobs to avoid misalignment. - Suggested Solution: Define shared environment variables in a global configuration and reference them in pod templates.
env: - name: GOPROXY valueFrom: configMapKeyRef: name: global-config key: go-proxy
- File: Pod templates (
Summary of Suggested Changes:
- Refactor hardcoded URLs and duplicated logic into centralized scripts or configuration.
- Adjust resource allocation for efficiency and maintain consistency across jobs.
- Use persistent storage for Bazel cache volumes to improve build reliability.
- Add documentation and tests for temporary changes and edge cases.
- Centralize environment variable management to avoid misalignment across job configurations.
There was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary
This PR migrates the TiDB next-gen job pipelines to Google Cloud Platform (GCP), replacing legacy configurations with GCP-compatible settings. Changes include updating container images, modifying pipeline scripts to remove unstable cache URLs, and adapting volumes and resources for GCP infrastructure. The approach is largely straightforward, but introduces temporary hotfixes for migration replay, adjusts Kubernetes pod definitions, and updates CI pipeline configurations. Overall, the migration is functional but contains areas that could benefit from cleanup and optimization.
Critical Issues
-
Temporary Hotfixes Persist in Production Code
- Files:
pipeline.groovyacross multiple pipelines (e.g.,pull_build_next_gen,pull_integration_realcluster_test_next_gen, etc.) - Issue: Temporary hotfixes, such as modifying
Makefileand.bazelrcfiles, are embedded directly in the pipelines. This creates long-term maintenance risks and makes it harder to clean up later. - Suggestion:
Extract hotfix logic into clearly versioned scripts or ensure these changes are gated with feature flags. Provide clear documentation for cleanup timelines.if (MIGRATION_REPLAY) { sh './scripts/hotfix_bazel_cache.sh' }
- Files:
-
Potential Resource Waste in Ephemeral Volumes
- Files:
pod.yamlacross multiple pipelines - Issue: Ephemeral volumes are configured with large
storageClassName: hyperdisk-rwosizes (e.g., 150Gi, 200Gi), but it's unclear whether this is necessary for all jobs. - Suggestion: Audit actual disk usage during pipeline runs and scale down volume sizes based on real needs. Over-provisioning may increase costs unnecessarily.
- Files:
Code Improvements
-
Code Duplication Across Pipeline Scripts
- Files: All
pipeline.groovyfiles (e.g.,pull_build_next_gen,pull_unit_test_next_gen, etc.) - Issue: Similar logic for hotfixes and preparation steps is duplicated across multiple scripts.
- Suggestion: Refactor common pipeline logic into reusable shared functions or libraries to improve maintainability.
def applyHotfixes(repoDir) { dir(repoDir) { sh ''' # Hotfix logic here ''' } }
- Files: All
-
Hardcoded Resource Requests and Limits
- Files:
pod.yamlacross pipelines - Issue: Resource requests and limits (e.g.,
cpu,memory) are hardcoded and may not adapt well to future workloads or environments. - Suggestion: Parameterize resource configurations using environment variables or pipeline inputs to simplify tuning.
resources: requests: memory: ${MEMORY_REQUEST: "24Gi"} cpu: ${CPU_REQUEST: "6"}
- Files:
Best Practices
-
Missing Documentation for Hotfixes
- Files: Pipeline scripts and CI configurations (
pipeline.groovy,pod.yaml) - Issue: Temporary hotfixes lack inline comments or documentation explaining their purpose and removal timelines.
- Suggestion: Add comments with clear context and timeline for cleanup.
// TODO: Remove this hotfix after migration is stable (expected Q3 2026).
- Files: Pipeline scripts and CI configurations (
-
Testing Coverage for Migration Replay
- Files: CI pipeline configurations (
latest-presubmits-next-gen.yaml) - Issue: The
pull_mysql_test_next_genjob hasoptional: true, which may make it harder to detect issues during migration replay. - Suggestion: Temporarily make critical replay jobs mandatory or add additional validation steps to ensure migration correctness.
- Files: CI pipeline configurations (
-
Style Inconsistencies in Shell Scripts
- Files: Embedded shell scripts in
pipeline.groovy - Issue: Some scripts use inconsistent formatting, such as mixing
#comments with inline comments or inconsistent error handling (set -euxo pipefailvs.|| true). - Suggestion: Standardize shell script formatting and error handling for readability and robustness.
- Files: Embedded shell scripts in
Conclusion
While the migration is functional, there are structural and resource optimization opportunities to improve maintainability, scalability, and alignment with best practices. Addressing the critical issues and refactoring common logic will reduce technical debt and ensure a smoother transition to GCP.
There was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary
This pull request migrates several CI pipeline jobs for the TiDB project to Google Cloud Platform (GCP). The approach involves updating Kubernetes pod specifications, pipeline scripts, and dependencies to align with GCP infrastructure requirements. While the changes are comprehensive and mostly functional, there are notable issues with maintainability, error handling, and adherence to best practices. Additional refinement is needed to improve long-term sustainability and robustness.
Critical Issues
-
Hardcoded Hotfix Commands:
- Files:
pipeline.groovyacross multiple jobs (e.g.,pull_build_next_gen,pull_integration_realcluster_test_next_gen, etc.). - Lines: Various sed commands modifying
WORKSPACE,DEPS.bzl,.bazelrc, andMakefile. - Issue: These hotfixes are hardcoded and temporary. If the underlying dependencies or files change, these commands may break or fail silently.
- Solution: Abstract these commands into reusable functions or scripts stored in version control. Validate changes using tests to ensure the integrity of modifications.
steps { sh ''' ./scripts/fix_bazel_deps.sh ''' }This method centralizes the logic and makes it easier to update or debug.
- Files:
-
Volume Management:
- Files: Various
pod.yamlfiles (e.g.,pull_build_next_gen/pod.yaml,pull_integration_realcluster_test_next_gen/pod.yaml). - Issue: Persistent volumes are replaced with ephemeral volumes. While this simplifies the setup, it may lead to data loss during container restarts and debugging complexities.
- Solution: Evaluate whether ephemeral volumes are appropriate for all use cases. Consider using persistent volumes with proper clean-up mechanisms for critical data.
- Files: Various
Code Improvements
-
Environment Variable Configuration:
- Files: Multiple
pod.yamlfiles. - Issue: The
GOPROXYenvironment variable is set to a specific proxy. This configuration assumes all environments will have access to the specified proxy. - Solution: Add fallback values and document this dependency. For example:
env: - name: GOPROXY value: "${GOPROXY:-https://proxy.golang.org,direct}"
- Files: Multiple
-
Resource Requests and Limits:
- Files: Multiple
pod.yamlfiles. - Issue: Resource allocations (CPU, memory, storage) are significantly increased without justification. This may cause inefficiencies in resource utilization.
- Solution: Provide documentation or comments explaining why these values were chosen. Consider profiling the pipeline jobs to optimize resource requests.
- Files: Multiple
-
Redundant Code:
- Files: Multiple
pipeline.groovyfiles (e.g.,pull_integration_realcluster_test_next_gen,pull_unit_test_next_gen). - Issue: Repeated logic for modifying Bazel dependencies and configurations.
- Solution: Refactor common logic into shared scripts or Groovy libraries. For example:
def fixBazelDeps(repoDir) { sh """ cd ${repoDir} ./scripts/fix_bazel_deps.sh """ }
- Files: Multiple
Best Practices
-
Testing Coverage:
- Issue: No evidence of tests specifically validating the migration to GCP.
- Solution: Add integration tests to ensure pipelines work as expected in GCP. Simulate GCP environments during test runs to verify compatibility.
-
Documentation:
- Issue: The PR lacks detailed documentation on the reason for migration and the implications of changes (e.g., storage class updates, resource limits).
- Solution: Update the PR description with detailed explanations. For example:
- Why are ephemeral volumes used?
- Why are resource requests increased?
- Why are certain dependencies removed from Bazel configurations?
-
Style Consistency:
- Issue: Inconsistent commenting and inline explanations across scripts.
- Solution: Ensure all scripts follow a consistent style guide. Use comments to explain commands that modify critical files, especially
sedcommands.
-
Naming Conventions:
- Files: Groovy and YAML files.
- Issue: Variable names like
REFS.repoandPOD_TEMPLATE_FILEare unclear without context. - Solution: Use more descriptive names, e.g.,
REPO_DIRECTORYorKUBERNETES_POD_TEMPLATE.
Suggested Follow-Up
-
Refactor Scripts:
Consolidate repeated logic into shared utilities or libraries to improve maintainability. -
Validation Pipeline:
Create a dedicated pipeline to validate changes in the migration process, especially for Bazel dependencies. -
Performance Benchmarking:
Profile pipelines in GCP to verify whether resource allocations are optimal. -
Final Cleanup:
Plan for removing temporary hotfixes once the migration stabilizes and ensure long-term compatibility.
There was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary
This pull request migrates several TiDB NextGen CI jobs to Google Cloud Platform (GCP), introducing updates to pipeline configurations, container specifications, and temporary fixes for legacy dependencies in Bazel files. The approach primarily involves modifying pipeline Groovy scripts, updating Kubernetes pod configurations, and addressing compatibility issues with external cache/mirror URLs. While the migration addresses key compatibility and resource allocation concerns, some areas require clarification, refactoring, and adherence to best practices.
Critical Issues
-
Bazel Cache Configuration Hardcoding
- File:
pipelines/pingcap/tidb/latest/pull_build_next_gen/pipeline.groovyand multiple pipeline scripts. - Issue: Temporary fixes like hardcoded removal of cache URLs (
sedcommands) and disabling remote cache in.bazelrcare fragile and may lead to long-term maintainability issues. - Suggested Solution: Introduce a centralized configuration management system or environment variable to handle cache/mirror URLs dynamically. This ensures scalability and reduces hardcoding across multiple scripts.
def bazelCacheURLs = [ "bazel-cache.pingcap.net:8080", "ats.apps.svc", "cache.hawkingrei.com", "mirror.bazel.build" ] sh """ for f in WORKSPACE DEPS.bzl; do [ -f "\$f" ] && sed -i -E '/${bazelCacheURLs.join('|')}/d' "\$f" done """
- File:
-
Ephemeral Volume Overprovisioning Risks
- File: Multiple pod YAML files (
pod.yaml). - Issue: Volumes such as
bazel-out-mergedare allocated with large ephemeral storage (200Gi), which might strain GCP resources and increase costs. - Suggested Solution: Perform a detailed analysis of actual storage needs and apply quotas based on job requirements. Consider monitoring usage to adjust limits dynamically.
- File: Multiple pod YAML files (
Code Improvements
-
Repeated Temporary Fixes Across Pipelines
- File: Multiple pipeline scripts (
pull_build_next_gen/pipeline.groovy,pull_integration_realcluster_test_next_gen/pipeline.groovy, etc.). - Issue: Hotfixes for Bazel cache URLs and
Makefilemodifications are duplicated across several scripts. - Suggested Solution: Encapsulate common fixes into a reusable function or shared library, reducing duplication. Example:
def applyTemporaryBazelFixes(repoPath) { sh """ for f in WORKSPACE DEPS.bzl; do [ -f "\$repoPath/\$f" ] && sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "\$repoPath/\$f" done sed -i 's/^check: check-bazel-prepare /check: /' "\$repoPath/Makefile" || true """ }
- File: Multiple pipeline scripts (
-
Resource Requests and Limits
- File: Pod YAML files (
pull_mysql_client_test_next_gen/pod.yaml,pull_unit_test_next_gen/pod.yaml, etc.). - Issue: Resource requests and limits (e.g., CPU and memory) vary widely without clear justification.
- Suggested Solution: Align resource allocation with job requirements using benchmarks and adjust dynamically based on historical data. Avoid over-allocation by defining reasonable baselines.
- File: Pod YAML files (
Best Practices
-
Testing and Validation of Bazel Fixes
- File: Multiple pipeline scripts.
- Issue: Temporary Bazel fixes are applied without automated validation to ensure they don't inadvertently break functionality.
- Suggested Solution: Add unit tests or validation scripts to verify the integrity of Bazel files after modifications.
# Example validation script if grep -qE 'bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build' WORKSPACE DEPS.bzl; then echo "Error: Legacy URLs still exist in Bazel files." exit 1 fi
-
Documentation for Temporary Fixes
- File: Pipeline scripts (
pipeline.groovy). - Issue: Temporary fixes lack inline comments or documentation explaining their rationale and timeframe for removal.
- Suggested Solution: Add comments specifying why these fixes are temporary and when they should be revisited.
// Temporary Bazel hotfix: Remove unstable cache URLs (to be removed after migration completes). sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' WORKSPACE
- File: Pipeline scripts (
-
Consistent Naming and Labels
- File:
prow-jobs/pingcap/tidb/latest-presubmits-next-gen.yaml - Issue: New
labels(e.g.,master: "1") are introduced without clear explanation or consistency across jobs. - Suggested Solution: Standardize label usage and document their purpose in YAML files or related documentation.
- File:
By addressing these issues, the migration can achieve better maintainability, scalability, and alignment with best practices.
| # value: /share/.go | ||
| # - name: GOCACHE | ||
| # value: /share/.cache/go-build | ||
| ephemeral-storage: 150Gi |
There was a problem hiding this comment.
will it make the pod can not be scheduled?
There was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary
This PR migrates multiple CI/CD jobs for the TiDB project to run on Google Cloud Platform (GCP). It includes updates to pipelines, pod configurations, and environment settings for compatibility with GCP infrastructure. The approach primarily involves replacing legacy configurations, updating resource allocations, and cleaning up cache-related paths. The changes are generally well-organized, but there are areas for improvement, particularly in error handling, performance considerations, and adherence to best practices.
Critical Issues
-
Hardcoded Cache Cleanup Without Validation (Multiple Groovy Files)
- Files:
pull_build_next_gen/pipeline.groovy,pull_integration_realcluster_test_next_gen/pipeline.groovy,pull_unit_test_ddlv1.groovy,pull_unit_test_next_gen/pipeline.groovy - Issue: The
sedcommands for cleaning up legacy cache URLs and modifying Makefiles assume all files exist and are writable. If these assumptions are incorrect, the pipeline could silently fail or produce unexpected behavior. - Suggestion: Add existence checks before attempting file modifications and log errors explicitly. Example:
sh '''#!/usr/bin/env bash for file in WORKSPACE DEPS.bzl Makefile; do if [ -f "$file" ]; then sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "$file" else echo "Warning: $file not found or inaccessible" >&2 fi done '''
- Files:
-
Resource Allocation Inconsistencies (Pod Configurations)
- Files: Multiple
pod.yamlfiles, e.g.,pull_build_next_gen/pod.yaml,pull_integration_realcluster_test_next_gen/pod.yaml - Issue: Resource requests and limits (e.g.,
memory,cpu) are inconsistent across jobs without justification. For instance, some pods request 24Gi memory while others request up to 48Gi without clear reasoning. Overprovisioning could lead to inefficiencies. - Suggestion: Standardize resource allocation and document any deviations. If different jobs have varying requirements, add comments explaining the rationale.
- Files: Multiple
-
Ephemeral Volumes Without Backup Strategy
- Files: Multiple
pod.yamlfiles, e.g.,pull_build_next_gen/pod.yaml,pull_unit_test_next_gen/pod.yaml - Issue: Ephemeral volumes (
emptyDir) are used for critical caching purposes without fallback mechanisms. If pod restarts occur, cached data will be lost, potentially slowing down builds/tests. - Suggestion: Use persistent volumes or implement fallback logic to rehydrate caches after pod restarts.
- Files: Multiple
Code Improvements
-
Temporary Hotfix Logic Should Be Modularized
- Files: Multiple Groovy files (e.g.,
pull_integration_realcluster_test_next_gen/pipeline.groovy). - Issue: The hotfix logic for cleaning cache URLs and modifying Makefiles is duplicated across several pipeline files. This leads to code duplication and maintenance challenges.
- Suggestion: Refactor the hotfix logic into a reusable shared library or function. Example:
void cleanLegacyCache(String repoDir) { dir(repoDir) { sh '''#!/usr/bin/env bash for file in WORKSPACE DEPS.bzl Makefile; do if [ -f "$file" ]; then sed -i -E '/bazel-cache[.]pingcap[.]net:8080|ats[.]apps[.]svc|cache[.]hawkingrei[.]com|mirror[.]bazel[.]build/d' "$file" fi done ''' } }
- Files: Multiple Groovy files (e.g.,
-
Redundant Volume Definitions
- Files: Multiple
pod.yamlfiles (e.g.,pull_mysql_test_next_gen/pod.yaml). - Issue: Persistent volumes (
persistentVolumeClaim) are replaced by ephemeral volumes without clear reasoning in some cases. This creates redundancy and inconsistency. - Suggestion: Consolidate volume definitions and document why ephemeral volumes are preferred. If persistent volumes are retired, update related comments.
- Files: Multiple
Best Practices
-
Missing Error Handling for Bash Commands
- Files: Multiple Groovy files (e.g.,
pull_unit_test_ddlv1.groovy,pull_integration_e2e_test_next_gen/pipeline.groovy). - Issue: Bash scripts (
sh) lack robust error handling for commands that could fail silently. - Suggestion: Use
set -o pipefailand add explicit error messages where applicable.
- Files: Multiple Groovy files (e.g.,
-
Testing Coverage for Hotfix Logic
- Files: All affected pipelines
- Issue: The temporary hotfix logic is applied without mechanisms to verify its correctness or impact on the jobs.
- Suggestion: Add a validation step or assertions after applying the hotfix logic to ensure changes are effective.
-
Unclear Comments and Documentation
- Files: Multiple Groovy files and
pod.yamlfiles. - Issue: Comments explaining changes (e.g., resource allocation, cache modifications) are sparse or non-existent. This makes it harder for future developers to understand the rationale behind decisions.
- Suggestion: Add comments wherever significant changes are made, especially for resource configurations and temporary hotfixes.
- Files: Multiple Groovy files and
-
Labels in Prow Job Configuration
- Files:
latest-presubmits-next-gen.yaml - Issue: Labels like
master: "1"are added without explanation. Their purpose is unclear and should be documented. - Suggestion: Add comments describing the significance of these labels.
- Files:
Conclusion
While the migration is comprehensive, improvements in error handling, resource standardization, code modularity, and documentation will enhance maintainability and reliability. Addressing the critical issues around cache handling and resource allocation should be prioritized.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wuhuizuo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
57d41f1 to
ae138a8
Compare
[LGTM Timeline notifier]Timeline:
|
|
New changes are detected. LGTM label has been removed. |
|
@dillon-zheng: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary
This PR migrates various CI jobs for the TiDB "next-gen" pipeline to Google Cloud Platform (GCP) by updating configuration files, pod specifications, and pipeline scripts. The approach includes replacing outdated cache URLs, modifying resource requests/limits, transitioning to GCP-specific container registries, and adding ephemeral volumes. While the migration is well-structured, there are several opportunities to improve maintainability, consistency, and error handling.
Critical Issues
-
Unconditional Cache Removal
- File: Multiple pipeline
.groovyfiles (e.g.,pull_build_next_gen/pipeline.groovy,pull_unit_test_ddlv1.groovy) - Line(s): Examples: 40, 51, 80 (across multiple files)
- Issue: The script unconditionally removes legacy URLs (e.g.,
bazel-cache.pingcap.net) without checking if other valid URLs are present. This could cause unintended disruptions if GCP mirrors fail or are misconfigured. - Solution: Add validation before removal. Example:
Also, consider logging the URLs removed for debugging purposes.
if grep -q 'bazel-cache.pingcap.net' "$file"; then sed -i -E '/bazel-cache[.]pingcap[.]net/d' "$file" fi
- File: Multiple pipeline
-
Hardcoded GOPROXY Value
- File: Multiple pod
yamlfiles (e.g.,pull_build_next_gen/pod.yaml,pull_unit_test_next_gen/pod.yaml) - Line(s): Examples: Lines including
GOPROXYenv variable - Issue: The hardcoded
GOPROXYURL does not fallback to default or other mirrors if unavailable. This could block builds in case of network issues. - Solution: Add a fallback mechanism, e.g.,
value: "https://us-go.pkg.dev/...|https://proxy.golang.org,direct|https://goproxy.cn"
- File: Multiple pod
-
Missing Error Handling for Ephemeral Volume Creation
- File: Pod specification files (e.g.,
pod.yamlforpull_unit_test_next_gen) - Line(s): Examples:
volumeClaimTemplatesections - Issue: If ephemeral volumes fail to initialize (e.g., incorrect
storageClassName), there is no retry or fallback mechanism. - Solution: Add validation or retries for volume provisioning during initialization by monitoring pod readiness in the pipeline.
- File: Pod specification files (e.g.,
Code Improvements
-
Resource Allocation Optimization
- File: Pod YAML files (e.g.,
pull_build_next_gen/pod.yaml,pull_unit_test_next_gen/pod.yaml) - Line(s): Examples: Memory and CPU requests/limits
- Issue: CPU and memory limits are inconsistently allocated across jobs (e.g.,
16 CPUsforgolangin one job vs6 CPUsin another). This may lead to inefficiency or underutilized resources. - Solution: Define consistent resource profiles based on expected workloads across jobs. For example:
resources: requests: memory: 16Gi cpu: "8" limits: memory: 32Gi cpu: "16"
- File: Pod YAML files (e.g.,
-
Temporary Hotfix Logic
- File: Multiple pipeline
.groovyfiles (e.g.,pull_integration_realcluster_test_next_gen/pipeline.groovy) - Line(s): Examples: Stages named
Hotfix bazel deps/cache - Issue: Temporary hotfix logic (e.g., disabling remote cache) introduces technical debt and could disrupt long-term stability if not properly tracked.
- Solution: Document the rationale for these hotfixes in comments and set up an automated cleanup task post-migration.
- File: Multiple pipeline
Best Practices
-
Documentation Needs
- File: All pipeline
.groovyfiles - Issue: The
Hotfix bazel deps/cacheand similar stages lack comments explaining why the changes are necessary. This makes it harder for future maintainers to understand the context. - Solution: Add comments:
// Temporary workaround for instability in legacy cache URLs during GCP migration. // Remove this stage after migration completion.
- File: All pipeline
-
Testing Coverage for Migration
- File: Multiple pipeline
.groovyfiles - Issue: There is no evidence of specific tests for validating the GCP migration (e.g., verifying ephemeral volumes, cache handling, or resource limits).
- Solution: Add integration tests to verify resource provisioning and cache integrity:
stage('Validation') { steps { sh ''' kubectl get pod -n $K8S_NAMESPACE | grep <pod-name> ls -l /home/jenkins/.tidb/tmp ''' } }
- File: Multiple pipeline
-
Consistent Naming Conventions
- File:
latest-presubmits-next-gen.yaml - Line(s): Examples: Labels like
master: "1" - Issue: Labels such as
master: "1"are ambiguous and lack standardization. - Solution: Use descriptive labels, e.g.,
migration-phase: "completed".
- File:
-
Redundant Code
- File: Multiple pipeline
.groovyfiles - Line(s): Examples: Script blocks for cache removal duplicated across files
- Issue: Duplicated scripts for cache removal increase maintenance burden.
- Solution: Refactor into a reusable function or shared script file.
- File: Multiple pipeline
By addressing these issues, the PR can improve its robustness, maintainability, and adherence to best practices.
6 jobs replay pass: