Chunxiang Xu avadesian
Loading Heatmap…

avadesian synced commits to simple-ray at avadesian/skypilot from mirror

4 hours ago

avadesian synced commits to resource-mgmt-pool-with-job at avadesian/skypilot from mirror

  • e276148daf Merge remote-tracking branch 'origin/master' into resource-mgmt-pool-with-job
  • 713daac964 [Catalog] Revert catalog change for Trainium and Inferentia chips (#7939) Revert catalog change
  • 179cfcea00 [Test]Production large load test (#7912) * readme update * test script * support clean up * production load test * longer sky status seconds * wait for termination * robust tests * longer timeout and sql update * cancel all jobs * sky down * clean up
  • 85623d818e [Tests] Increase Nightly Build Timeouts (#7933) Increase timeout to reduce false negatives.
  • 276999004e Updated for better Trainium/Inferentia support (#7896) * updated fetch_aws script to properly add Trainium/Inferentia to acclerator name and populate GpuInfo field. Added refresh of neuron based AMI IDs based on SSM parameters * removed incorrect code comment for generating vms.csv file * removed duplicate import * removed aws from .gitignore; fixed linting errors of fetch_aws.py file; fixed check of acc_name to be prefix rather than exact match of Trainium/Inferentia * fixed pylint formatting for aws.py * Update .gitignore --------- Co-authored-by: DanielZhangQD <36026334+DanielZhangQD@users.noreply.github.com>
  • Compare 259 commits »

4 hours ago

avadesian synced commits to pools-example-simple at avadesian/skypilot from mirror

4 hours ago

avadesian synced commits to master at avadesian/skypilot from mirror

  • d1f0cdd25a Seeweb - Docker Images + Pinned Library (#7493) * seeweb-commit-review-11 * seeweb-commit-review-12 * seeweb-commit-review-13 * seeweb-commit-review-14 * Revert pytest.yml * Revert smoke-tests-trigger.yaml * Revert and fix test * Update seeweb accelerator test * Update test_sky_serve * Update test_sky_serve.py * seeweb-commit-review-15 * seeweb-commit-docs-01 * seeweb-commit-docs-02 * seeweb-commit-snapshotfix-01 * seeweb-commit-downfix-01 * seeweb-commit-docs-02 * seeweb-commit-libraryupd-01 * seeweb-commit-update_library-02 * seeweb-dockerimg-01 * fixeddependencies.py --------- Co-authored-by: Marco Cristofanilli <m4oc@users.noreply.github.com>
  • 0d92d4c909 [Pools] Fail Fast when Pools Misconfigured (#7930) * Error when pool misconfigured. * Update and format. * Don't error during num workers update. * Add check in sdk and unit test.
  • 713daac964 [Catalog] Revert catalog change for Trainium and Inferentia chips (#7939) Revert catalog change
  • Compare 3 commits »

4 hours ago

avadesian synced commits to lloyd/improve-concurrent-job-launch at avadesian/skypilot from mirror

4 hours ago

avadesian synced commits to lloyd/fix-pools-docs at avadesian/skypilot from mirror

4 hours ago

avadesian synced commits to master at avadesian/skypilot from mirror

  • 179cfcea00 [Test]Production large load test (#7912) * readme update * test script * support clean up * production load test * longer sky status seconds * wait for termination * robust tests * longer timeout and sql update * cancel all jobs * sky down * clean up
  • 85623d818e [Tests] Increase Nightly Build Timeouts (#7933) Increase timeout to reduce false negatives.
  • 276999004e Updated for better Trainium/Inferentia support (#7896) * updated fetch_aws script to properly add Trainium/Inferentia to acclerator name and populate GpuInfo field. Added refresh of neuron based AMI IDs based on SSM parameters * removed incorrect code comment for generating vms.csv file * removed duplicate import * removed aws from .gitignore; fixed linting errors of fetch_aws.py file; fixed check of acc_name to be prefix rather than exact match of Trainium/Inferentia * fixed pylint formatting for aws.py * Update .gitignore --------- Co-authored-by: DanielZhangQD <36026334+DanielZhangQD@users.noreply.github.com>
  • af58b86002 Example policy for static GPU quota (#7917) gpu quota policy
  • c376dcff4b [Dashboard] enable filtering clusters and jobs by user (#7929) enable filtering clusters and jobs by user
  • Compare 14 commits »

17 hours ago

avadesian synced commits to master at avadesian/skypilot from mirror

  • da450a0035 [Azure] Pin azcopy version (#7909) * [Azure] Pin azcopy version * check arm64 * fix lint --------- Co-authored-by: ZePing Guo <zp0int@qq.com>

2 days ago

avadesian synced commits to improve-tunnel-open at avadesian/skypilot from mirror

2 days ago

avadesian synced commits to admin-docs-update-2 at avadesian/skypilot from mirror

2 days ago

avadesian synced commits to 2904 at avadesian/skypilot from mirror

2 days ago

avadesian synced commits to master at avadesian/skypilot from mirror

  • 11c5da60e6 [Catalog] Case insensitive check for canonicalized acc and cloud catalog (#7910) * [Catalog] Case insensitive check for canonicalized acc and instance type * minor
  • 44bb11e9e8 [Deployment] Use recreate for grafana deployment to avoid PVC blocking upgrade (#7904) Use recreate for grafana deployment to avoid RW-once PVC causing upgrade failure
  • 98b9165735 [Dev] Fix RTD call additional newlines
  • 277d04dcb5 [Dev] Trigger RTD version sync before build
  • dec33a9702 [Dev] Fix RTD project name
  • Compare 8 commits »

2 days ago

avadesian synced commits to batch-inference-with-pool at avadesian/skypilot from mirror

  • 641d2dcce3 Merge branch 'master' of github.com:skypilot-org/skypilot into batch-inference-with-pool
  • 7cff73216d [RunPod] Update runpod default image (#7906)
  • 8d17955d0b [SSH Node Pools] Allow per context config and provision_timeout (#7660) * Allow per context config for SSH node pool * fix the context config retrieval * strip ssh- instead * Add tests * fix tests * fix test * use a subset * Add ports back * better comment to explain ssh schema * Remove kueue and support provision_timeout * add comment * Add unit tests * same order * format * fix unit tests * format * Update sky/utils/schemas.py * fix constant * Add more tests --------- Co-authored-by: Daniel Shin <kyuseung1016@gmail.com> Co-authored-by: Daniel Shin <88547237+kyuds@users.noreply.github.com>
  • d8034a53bb [Core] Add support for num_gpus in setup. (#7092) * Add support for num_gpus in setup. * Add smoke test for regular launch. * Fix cluster name. * Format. * Change variable. * Change pools test. * Add note in docs. * Fix fractional GPUs. * Revert to old logic for runtime. * Add multiple accelerator example. * Add new unit test for multiple resources. * Rebase. * Format. * Get num_gpus from launched resources. * Revert "Get num_gpus from launched resources." This reverts commit cb5785ffd7a7a74a1a858f11a63c63123141a097. * Fix docs. * Remove unnecessary calls. * Remove resources fit. * Fix unit test. * format * format. * Remove extra test. --------- Co-authored-by: lloydbrownjr <lloydbrown@berkeley.edu>
  • 1337c50037 example rate limiter policy (#7902) * example rate limiter policy * docs * reasonable defaults * remove debug comment
  • Compare 627 commits »

2 days ago

avadesian synced commits to add-manual-rtd-build-action at avadesian/skypilot from mirror

2 days ago

avadesian synced commits to master at avadesian/skypilot from mirror

  • 7cff73216d [RunPod] Update runpod default image (#7906)
  • 8d17955d0b [SSH Node Pools] Allow per context config and provision_timeout (#7660) * Allow per context config for SSH node pool * fix the context config retrieval * strip ssh- instead * Add tests * fix tests * fix test * use a subset * Add ports back * better comment to explain ssh schema * Remove kueue and support provision_timeout * add comment * Add unit tests * same order * format * fix unit tests * format * Update sky/utils/schemas.py * fix constant * Add more tests --------- Co-authored-by: Daniel Shin <kyuseung1016@gmail.com> Co-authored-by: Daniel Shin <88547237+kyuds@users.noreply.github.com>
  • Compare 2 commits »

3 days ago

avadesian synced commits to allow-per-context-config-for-ssh at avadesian/skypilot from mirror

  • 9093c8c658 Merge branch 'master' into allow-per-context-config-for-ssh
  • d8034a53bb [Core] Add support for num_gpus in setup. (#7092) * Add support for num_gpus in setup. * Add smoke test for regular launch. * Fix cluster name. * Format. * Change variable. * Change pools test. * Add note in docs. * Fix fractional GPUs. * Revert to old logic for runtime. * Add multiple accelerator example. * Add new unit test for multiple resources. * Rebase. * Format. * Get num_gpus from launched resources. * Revert "Get num_gpus from launched resources." This reverts commit cb5785ffd7a7a74a1a858f11a63c63123141a097. * Fix docs. * Remove unnecessary calls. * Remove resources fit. * Fix unit test. * format * format. * Remove extra test. --------- Co-authored-by: lloydbrownjr <lloydbrown@berkeley.edu>
  • 1337c50037 example rate limiter policy (#7902) * example rate limiter policy * docs * reasonable defaults * remove debug comment
  • 3d58efeee2 [Core] make run ray status during refresh safe from referencing unassigned local variable (#7901) make run ray status to check ray cluster healthy safe from local variable referenced before assignment
  • 3e5c599fde [Logs] make provision logs not timeout (#7888) * make provision logs not timeout * fix indenting issue * simplify worker logic
  • Compare 145 commits »

3 days ago

avadesian synced commits to master at avadesian/skypilot from mirror

  • d8034a53bb [Core] Add support for num_gpus in setup. (#7092) * Add support for num_gpus in setup. * Add smoke test for regular launch. * Fix cluster name. * Format. * Change variable. * Change pools test. * Add note in docs. * Fix fractional GPUs. * Revert to old logic for runtime. * Add multiple accelerator example. * Add new unit test for multiple resources. * Rebase. * Format. * Get num_gpus from launched resources. * Revert "Get num_gpus from launched resources." This reverts commit cb5785ffd7a7a74a1a858f11a63c63123141a097. * Fix docs. * Remove unnecessary calls. * Remove resources fit. * Fix unit test. * format * format. * Remove extra test. --------- Co-authored-by: lloydbrownjr <lloydbrown@berkeley.edu>
  • 1337c50037 example rate limiter policy (#7902) * example rate limiter policy * docs * reasonable defaults * remove debug comment
  • 3d58efeee2 [Core] make run ray status during refresh safe from referencing unassigned local variable (#7901) make run ray status to check ray cluster healthy safe from local variable referenced before assignment
  • 3e5c599fde [Logs] make provision logs not timeout (#7888) * make provision logs not timeout * fix indenting issue * simplify worker logic
  • 4fb702215a [Dashboard] Show Succeeded Pools in Dropdown (#7855) Show succeeded pools.
  • Compare 11 commits »

4 days ago

avadesian synced commits to lloyd/pools-run-message at avadesian/skypilot from mirror

4 days ago

avadesian synced commits to serve-files-stash-to-db at avadesian/skypilot from mirror

5 days ago

avadesian synced commits to master at avadesian/skypilot from mirror

  • 767213dc7c [k8s] retry k8s api list namespaced pod on empty response (#7854) * add more debug logs around k8s list_namespaced_pod * label selector * [k8s] retry k8s api list namespaced pod on empty response * tests * reorder retry logic * only retry for status refresh daemon and status -r * remove unnecessary retry_if_missing * remove defaults * fix test * update retry if missing logic to only run on sky status -r * clarity * change retry if missing logic for refresh * update logic * update docstring * only kubernetes docstring * adjust logic and add comment * tighten * update test
  • d45fd6330a [Core] Ensure SSH tunnel process is terminated when cluster is torn down (#7887) * Add test_no_ssh_tunnel_process_leak_after_teardown * close skylet ssh tunnel in teardown_no_lock * use kill_children_processes instead * only update db if stop * Revert "only update db if stop" This reverts commit fe96ee18be8cee583f68356b3436a9b701e64ea2. * graceful termination * proc.pid
  • 38c27f5ea6 [Kubernetes] Check supported features for all allowed contexts (#7878) * check supported features for all contexts * add ut * check features support n parallel
  • e07d94d3b5 [Dashboard] Return resources_str_full for cluster records of the dashboard requests (#7880) return resources_str_full for cluster records of the dashboard requests
  • 75458afd67 Admin policy docs update (#7866) * init * init2 * minor correction --------- Co-authored-by: tv <tv@tvs-MacBook-Pro.attlocal.net>
  • Compare 9 commits »

5 days ago