Skip to content

{2025.06}[2024a] PyTorch 2.9.1#1389

Draft
bedroge wants to merge 12 commits into
EESSI:mainfrom
bedroge:pytorch291
Draft

{2025.06}[2024a] PyTorch 2.9.1#1389
bedroge wants to merge 12 commits into
EESSI:mainfrom
bedroge:pytorch291

Conversation

@bedroge
Copy link
Copy Markdown
Collaborator

@bedroge bedroge commented Feb 16, 2026

No description provided.

@bedroge bedroge added the 2025.06-software.eessi.io 2025.06 version of software.eessi.io label Feb 16, 2026
@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Feb 16, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen4

@eessi-bot-aws-eu-south
Copy link
Copy Markdown

eessi-bot-aws-eu-south Bot commented Feb 16, 2026

New job on instance eessi-bot-mc-aws-eu-south for repository eessi.io-2025.06-software
Building on: amd-zen4
Building for: x86_64/amd/zen4
Job dir: /project/def-users/SHARED/jobs/2026.02/pr_1389/15

date job status comment
Feb 16 16:09:44 UTC 2026 submitted job id 15 awaits release by job manager
Feb 16 16:10:36 UTC 2026 released job awaits launch by Slurm scheduler
Feb 16 16:11:40 UTC 2026 running job 15 is running
Feb 16 16:12:41 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-15.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen4-17712582990.tar.zstsize: 0 MiB (22 bytes)
entries: 0
modules under 2025.06/software/linux/x86_64/amd/zen4/modules/all
no module files in tarball
software under 2025.06/software/linux/x86_64/amd/zen4/software
no software packages in tarball
reprod directories under 2025.06/software/linux/x86_64/amd/zen4/reprod
no reprod directories in tarball
other under 2025.06/software/linux/x86_64/amd/zen4
no other files in tarball
Feb 16 16:12:41 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-15.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws Bot commented Feb 16, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: amd-zen4
Building for: x86_64/amd/zen4
Job dir: /project/def-users/SHARED/jobs/2026.02/pr_1389/131514

date job status comment
Feb 16 16:09:44 UTC 2026 submitted job id 131514 awaits release by job manager
Feb 16 16:09:52 UTC 2026 released job awaits launch by Slurm scheduler
Feb 16 16:10:54 UTC 2026 running job 131514 is running
Feb 16 16:11:56 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-131514.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen4-17712582190.tar.zstsize: 0 MiB (22 bytes)
entries: 0
modules under 2025.06/software/linux/x86_64/amd/zen4/modules/all
no module files in tarball
software under 2025.06/software/linux/x86_64/amd/zen4/software
no software packages in tarball
reprod directories under 2025.06/software/linux/x86_64/amd/zen4/reprod
no reprod directories in tarball
other under 2025.06/software/linux/x86_64/amd/zen4
no other files in tarball
Feb 16 16:11:56 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86-64-zen4+default
P: latency: 1.45 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86-64-zen4+default
P: latency: 3.55 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86-64-zen4+default
P: latency: 0.15 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86-64-zen4+default
P: bandwidth: 14495.07 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-131514.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Feb 16, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen4

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws Bot commented Feb 16, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: amd-zen4
Building for: x86_64/amd/zen4
Job dir: /project/def-users/SHARED/jobs/2026.02/pr_1389/131515

date job status comment
Feb 16 16:36:49 UTC 2026 submitted job id 131515 awaits release by job manager
Feb 16 16:37:00 UTC 2026 released job awaits launch by Slurm scheduler
Feb 16 16:38:03 UTC 2026 running job 131515 is running
Feb 17 16:38:16 UTC 2026 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job131515.result does not exist in job directory or reading it failed.
  • No artefacts were found/reported.
Feb 17 16:38:16 UTC 2026 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job131515.test does not exist in job directory or reading it failed.

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Feb 17, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen4

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws Bot commented Feb 17, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: amd-zen4
Building for: x86_64/amd/zen4
Job dir: /project/def-users/SHARED/jobs/2026.02/pr_1389/131832

date job status comment
Feb 17 18:45:52 UTC 2026 submitted job id 131832 awaits release by job manager
Feb 17 18:46:22 UTC 2026 released job awaits launch by Slurm scheduler
Feb 17 18:52:25 UTC 2026 running job 131832 is running
Feb 19 05:59:11 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-131832.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen4-17714803560.tar.zstsize: 5 MiB (5327720 bytes)
entries: 1120
modules under 2025.06/software/linux/x86_64/amd/zen4/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/x86_64/amd/zen4/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/x86_64/amd/zen4/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260217_185237UTC
tlparse/0.4.0-GCCcore-13.3.0/20260217_185339UTC
other under 2025.06/software/linux/x86_64/amd/zen4
no other files in tarball
Feb 19 05:59:12 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86-64-zen4+default
P: latency: 1.4 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86-64-zen4+default
P: latency: 3.18 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86-64-zen4+default
P: latency: 0.18 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86-64-zen4+default
P: bandwidth: 14180.38 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-131832.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Feb 19, 2026

WARNING: 143 test failures, 0 test errors (out of 262630):
        distributed/test_c10d_functional_native (1 failed, 2 passed, 29 skipped, 0 errors)
        dynamo/test_error_messages (1 failed, 40 passed, 0 skipped, 0 errors)
        inductor/test_aot_inductor_arrayref (107 failed, 11 passed, 169 skipped, 0 errors)
        inductor/test_compile_subprocess (6 failed, 756 passed, 90 skipped, 0 errors)
        inductor/test_cpu_select_algorithm (1 failed, 89 passed, 1620 skipped, 0 errors)
        inductor/test_minifier (3 failed, 5 passed, 6 skipped, 0 errors)
        inductor/test_provenance_tracing (1 failed, 4 passed, 6 skipped, 0 errors)
        inductor/test_torchbind (5 failed, 10 passed, 1 skipped, 0 errors)
        inductor/test_torchinductor (6 failed, 820 passed, 87 skipped, 0 errors)
        inductor/test_torchinductor_codegen_dynamic_shapes (6 failed, 611 passed, 231 skipped, 0 errors)
        inductor/test_torchinductor_dynamic_shapes (6 failed, 756 passed, 149 skipped, 0 errors)

Errors are quite similar to the ones observed in #1314, many of these:

E       RuntimeError: Error in dlopen: /tmp/R8KYZy/cargdsatqw56h7ghmssrcrbgbyjsjff7hdhzslb7qz3dsz3pbati.wrapper/data/aotinductor/model/cargdsatqw56h7ghmssrcrbgbyjsjff7hdhzslb7qz3dsz3pbati.wrapper.so: cannot enable executable stack as shared object requires: Invalid argument
$ grep "cannot enable executable stack" /project/def-users/SHARED/build-logs/jobs/131832/easybuild-fsgm_e4f.log | wc -l
1716

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 11, 2026

Updated hooks file with a fix for PyTorch has been ingested (EESSI/software-layer-scripts#172), let's try again.

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-aws-eu-south for:arch=x86_64/amd/zen5
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/intel/skylake_avx512
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1

@eessi-bot-aws-eu-south
Copy link
Copy Markdown

eessi-bot-aws-eu-south Bot commented Mar 11, 2026

New job on instance eessi-bot-aws-eu-south for repository eessi.io-2025.06-software
Building on: amd-zen5
Building for: x86_64/amd/zen5
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/104

date job status comment
Mar 11 13:00:05 UTC 2026 submitted job id 104 awaits release by job manager
Mar 11 13:00:57 UTC 2026 released job awaits launch by Slurm scheduler
Mar 11 13:02:00 UTC 2026 running job 104 is running
Mar 12 11:02:15 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-104.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen5-17733132100.tar.zstsize: 165 MiB (173155391 bytes)
entries: 22819
modules under 2025.06/software/linux/x86_64/amd/zen5/modules/all
PyTorch/2.9.1-foss-2024a.lua
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/x86_64/amd/zen5/software
PyTorch/2.9.1-foss-2024a
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/x86_64/amd/zen5/reprod
PyTorch/2.9.1-foss-2024a/20260312_105947UTC
setuptools/80.9.0-GCCcore-13.3.0/20260311_130254UTC
tlparse/0.4.0-GCCcore-13.3.0/20260311_130336UTC
other under 2025.06/software/linux/x86_64/amd/zen5
no other files in tarball
Mar 12 11:02:15 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86-64-zen5+default
P: latency: 1.24 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86-64-zen5+default
P: latency: 2.84 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86-64-zen5+default
P: latency: 0.15 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86-64-zen5+default
P: bandwidth: 46332.3 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-104.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws Bot commented Mar 11, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: intel-skylake_avx512
Building for: x86_64/intel/skylake_avx512
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/138591

date job status comment
Mar 11 13:00:07 UTC 2026 submitted job id 138591 awaits release by job manager
Mar 11 13:01:13 UTC 2026 released job awaits launch by Slurm scheduler
Mar 11 13:08:21 UTC 2026 running job 138591 is running
Mar 12 05:05:40 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-138591.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-intel-skylake_avx512-17732918120.tar.zstsize: 164 MiB (172308254 bytes)
entries: 22819
modules under 2025.06/software/linux/x86_64/intel/skylake_avx512/modules/all
PyTorch/2.9.1-foss-2024a.lua
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/x86_64/intel/skylake_avx512/software
PyTorch/2.9.1-foss-2024a
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/x86_64/intel/skylake_avx512/reprod
PyTorch/2.9.1-foss-2024a/20260312_050304UTC
setuptools/80.9.0-GCCcore-13.3.0/20260311_130842UTC
tlparse/0.4.0-GCCcore-13.3.0/20260311_131000UTC
other under 2025.06/software/linux/x86_64/intel/skylake_avx512
no other files in tarball
Mar 12 05:05:40 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86-64-skylake+default
P: latency: 1.41 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86-64-skylake+default
P: latency: 1.67 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86-64-skylake+default
P: latency: 0.28 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86-64-skylake+default
P: bandwidth: 10855.78 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138591.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws Bot commented Mar 11, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: neoverse_v1
Building for: aarch64/neoverse_v1
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/138592

date job status comment
Mar 11 13:00:13 UTC 2026 submitted job id 138592 awaits release by job manager
Mar 11 13:01:11 UTC 2026 released job awaits launch by Slurm scheduler
Mar 11 13:06:16 UTC 2026 running job 138592 is running
Mar 11 13:28:10 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-138592.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-neoverse_v1-17732356160.tar.zstsize: 4 MiB (5237091 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/neoverse_v1/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/neoverse_v1/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/neoverse_v1/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260311_130702UTC
tlparse/0.4.0-GCCcore-13.3.0/20260311_130801UTC
other under 2025.06/software/linux/aarch64/neoverse_v1
no other files in tarball
Mar 11 13:28:10 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 1.59 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 5.47 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 0.26 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-neoverse_v1+default
P: bandwidth: 21953.88 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138592.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 11, 2026

The neoverse v1 build ran out of memory:

virtual memory exhausted: Cannot allocate memory
ninja: build stopped: subcommand failed.

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 12, 2026

I've modified bot/build.sh for now, so we can easily test changes in the hooks file. I guess we may need more...

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws Bot commented Mar 12, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: neoverse_v1
Building for: aarch64/neoverse_v1
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/138897

date job status comment
Mar 12 15:51:59 UTC 2026 submitted job id 138897 awaits release by job manager
Mar 12 15:52:11 UTC 2026 released job awaits launch by Slurm scheduler
Mar 12 15:53:14 UTC 2026 running job 138897 is running
Mar 12 16:11:43 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-138897.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-neoverse_v1-17733318390.tar.zstsize: 4 MiB (5236098 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/neoverse_v1/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/neoverse_v1/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/neoverse_v1/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260312_155255UTC
tlparse/0.4.0-GCCcore-13.3.0/20260312_155350UTC
other under 2025.06/software/linux/aarch64/neoverse_v1
no other files in tarball
Mar 12 16:11:43 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 1.75 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 5.35 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 0.26 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-neoverse_v1+default
P: bandwidth: 28395.84 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138897.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge bedroge marked this pull request as draft March 12, 2026 15:52
@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 12, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented Mar 12, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.03/pr_1389/14557869

date job status comment
Mar 12 15:52:21 UTC 2026 submitted job id 14557869 awaits release by job manager
Mar 12 15:52:59 UTC 2026 released job awaits launch by Slurm scheduler
Mar 12 15:54:04 UTC 2026 running job 14557869 is running
Mar 12 23:01:17 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14557869.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17733559290.tar.gzsize: 5 MiB (6220484 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/nvidia/grace/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260312_155533UTC
tlparse/0.4.0-GCCcore-13.3.0/20260312_155902UTC
other under 2025.06/software/linux/aarch64/nvidia/grace
no other files in tarball
Mar 12 23:01:17 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] ( 1/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=gpu /6d7a17a9 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 2/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=gpu /e9b09ad8 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 3/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /a102bba0 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 4/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /d58e51e9 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ OK ] ( 5/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 2.51 us (r:0, l:None, u:None)
[ OK ] ( 6/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=cpu /0c56f933 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 3.42 us (r:0, l:None, u:None)
[ OK ] ( 7/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 6.14 us (r:0, l:None, u:None)
[ OK ] ( 8/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=cpu /ca426177 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 11.14 us (r:0, l:None, u:None)
[ OK ] ( 9/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.25 us (r:0, l:None, u:None)
[ OK ] (10/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /af5b485c @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.31 us (r:0, l:None, u:None)
[ OK ] (11/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 18411.29 MB/s (r:0, l:None, u:None)
[ OK ] (12/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /ebc0c2c2 @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 18768.48 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 8/12 test case(s) from 12 check(s) (0 failure(s), 4 skipped, 0 aborted)
Details
✅ job output file slurm-14557869.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 12, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/generic
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_n1

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws Bot commented Mar 12, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: generic
Building for: aarch64/generic
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/138898

date job status comment
Mar 12 19:20:11 UTC 2026 submitted job id 138898 awaits release by job manager
Mar 12 19:21:02 UTC 2026 released job awaits launch by Slurm scheduler
Mar 12 19:26:07 UTC 2026 running job 138898 is running
Mar 12 19:57:25 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-138898.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-generic-17733453230.tar.zstsize: 4 MiB (5223731 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/generic/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/generic/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/generic/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260312_192649UTC
tlparse/0.4.0-GCCcore-13.3.0/20260312_192804UTC
other under 2025.06/software/linux/aarch64/generic
no other files in tarball
Mar 12 19:57:25 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-generic+default
P: latency: 1.97 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-generic+default
P: latency: 5.49 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-generic+default
P: latency: 0.29 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-generic+default
P: bandwidth: 15264.28 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138898.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws Bot commented Mar 12, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: neoverse_n1
Building for: aarch64/neoverse_n1
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/138899

date job status comment
Mar 12 19:20:17 UTC 2026 submitted job id 138899 awaits release by job manager
Mar 12 19:21:04 UTC 2026 released job awaits launch by Slurm scheduler
Mar 12 19:26:09 UTC 2026 running job 138899 is running
Mar 12 19:56:23 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-138899.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-neoverse_n1-17733452760.tar.zstsize: 5 MiB (5243490 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/neoverse_n1/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/neoverse_n1/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/neoverse_n1/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260312_192646UTC
tlparse/0.4.0-GCCcore-13.3.0/20260312_192759UTC
other under 2025.06/software/linux/aarch64/neoverse_n1
no other files in tarball
Mar 12 19:56:23 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-neoverse_n1+default
P: latency: 1.98 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-neoverse_n1+default
P: latency: 6.3 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-neoverse_n1+default
P: latency: 0.28 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-neoverse_n1+default
P: bandwidth: 16378.5 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138899.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 12, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws Bot commented Mar 12, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: neoverse_v1
Building for: aarch64/neoverse_v1
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/138900

date job status comment
Mar 12 22:13:03 UTC 2026 submitted job id 138900 awaits release by job manager
Mar 12 22:13:39 UTC 2026 released job awaits launch by Slurm scheduler
Mar 12 22:18:42 UTC 2026 running job 138900 is running
Mar 13 08:07:11 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-138900.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-neoverse_v1-17733891460.tar.zstsize: 4 MiB (5236034 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/neoverse_v1/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/neoverse_v1/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/neoverse_v1/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260312_221833UTC
tlparse/0.4.0-GCCcore-13.3.0/20260312_221927UTC
other under 2025.06/software/linux/aarch64/neoverse_v1
no other files in tarball
Mar 13 08:07:11 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 1.59 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 5.41 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 0.25 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-neoverse_v1+default
P: bandwidth: 26832.18 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138900.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 13, 2026

No more memory issues for the neoverse v1 build, but too many failing tests:

WARNING: 73 test failures, 0 test errors (out of 261937):
Failed tests (suites/files):
        dynamo/test_error_messages (1 failed, 40 passed, 0 skipped, 0 errors)
        inductor/test_binary_folding (1 failed, 2 passed, 0 skipped, 0 errors)
        inductor/test_cpu_repro (5 failed, 210 passed, 526 skipped, 0 errors)
        inductor/test_cpu_select_algorithm (58 failed, 31 passed, 1621 skipped, 0 errors)
        inductor/test_fused_attention (2 failed, 45 passed, 1 skipped, 0 errors)
        test_decomp (2 failed, 8280 passed, 738 skipped, 0 errors)
        test_linalg (3 failed, 1124 passed, 118 skipped, 0 errors)
        test_multiprocessing_spawn (1 failed, 27 passed, 3 skipped, 0 errors)

@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented Mar 14, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws Bot commented Mar 14, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: neoverse_v1
Building for: aarch64/neoverse_v1
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/139504

date job status comment
Mar 14 07:25:04 UTC 2026 submitted job id 139504 awaits release by job manager
Mar 14 07:25:11 UTC 2026 released job awaits launch by Slurm scheduler
Mar 14 07:31:13 UTC 2026 running job 139504 is running
Mar 14 16:34:41 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-139504.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-neoverse_v1-17735059990.tar.zstsize: 134 MiB (140528916 bytes)
entries: 22902
modules under 2025.06/software/linux/aarch64/neoverse_v1/modules/all
PyTorch/2.9.1-foss-2024a.lua
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/neoverse_v1/software
PyTorch/2.9.1-foss-2024a
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/neoverse_v1/reprod
PyTorch/2.9.1-foss-2024a/20260314_163250UTC
setuptools/80.9.0-GCCcore-13.3.0/20260314_073106UTC
tlparse/0.4.0-GCCcore-13.3.0/20260314_073202UTC
other under 2025.06/software/linux/aarch64/neoverse_v1
no other files in tarball
Mar 14 16:34:41 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 1.63 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 5.39 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 0.26 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-neoverse_v1+default
P: bandwidth: 29034.94 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-139504.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented Mar 14, 2026

@bedroge Looks like we have a winner

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 14, 2026

@bedroge Looks like we have a winner

Awesome, thanks a lot @Flamefire!

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 14, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented Mar 14, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.03/pr_1389/14562516

date job status comment
Mar 14 18:42:37 UTC 2026 submitted job id 14562516 awaits release by job manager
Mar 14 18:43:35 UTC 2026 released job awaits launch by Slurm scheduler
Mar 14 18:44:39 UTC 2026 running job 14562516 is running
Mar 15 01:27:47 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-14562516.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17735377580.tar.gzsize: 158 MiB (165988645 bytes)
entries: 22902
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
PyTorch/2.9.1-foss-2024a.lua
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/nvidia/grace/software
PyTorch/2.9.1-foss-2024a
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
PyTorch/2.9.1-foss-2024a/20260315_012058UTC
setuptools/80.9.0-GCCcore-13.3.0/20260314_184556UTC
tlparse/0.4.0-GCCcore-13.3.0/20260314_184956UTC
other under 2025.06/software/linux/aarch64/nvidia/grace
no other files in tarball
Mar 15 01:27:47 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] ( 1/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=gpu /6d7a17a9 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 2/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=gpu /e9b09ad8 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 3/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /a102bba0 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 4/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /d58e51e9 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ OK ] ( 5/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 2.54 us (r:0, l:None, u:None)
[ OK ] ( 6/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=cpu /0c56f933 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 3.43 us (r:0, l:None, u:None)
[ OK ] ( 7/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 6.09 us (r:0, l:None, u:None)
[ OK ] ( 8/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=cpu /ca426177 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 11.1 us (r:0, l:None, u:None)
[ OK ] ( 9/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.25 us (r:0, l:None, u:None)
[ OK ] (10/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /af5b485c @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.31 us (r:0, l:None, u:None)
[ OK ] (11/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 18598.69 MB/s (r:0, l:None, u:None)
[ OK ] (12/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /ebc0c2c2 @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 18653.33 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 8/12 test case(s) from 12 check(s) (0 failure(s), 4 skipped, 0 aborted)
Details
✅ job output file slurm-14562516.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-deucalion
Copy link
Copy Markdown

eessi-bot-deucalion Bot commented Mar 14, 2026

New job on instance eessi-bot-deucalion for repository eessi.io-2025.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2026.03/pr_1389/1034652

date job status comment
Mar 14 18:42:38 UTC 2026 submitted job id 1034652 awaits release by job manager
Mar 14 18:43:32 UTC 2026 released job awaits launch by Slurm scheduler
Mar 14 18:44:35 UTC 2026 running job 1034652 is running
Mar 14 20:23:50 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-1034652.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-a64fx-17735195040.tar.zstsize: 5 MiB (5466852 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/a64fx/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/a64fx/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/a64fx/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260314_184846UTC
tlparse/0.4.0-GCCcore-13.3.0/20260314_185235UTC
other under 2025.06/software/linux/aarch64/a64fx
no other files in tarball
Mar 14 20:23:50 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ SKIP ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:a64fx+default
P: latency: 0.89 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:a64fx+default
P: bandwidth: 7784.23 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 2/4 test case(s) from 4 check(s) (0 failure(s), 2 skipped, 0 aborted)
Details
✅ job output file slurm-1034652.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 15, 2026

The a64fx build also ran out of memory, trying again with an updated hooks file...

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx

@eessi-bot-deucalion
Copy link
Copy Markdown

eessi-bot-deucalion Bot commented Mar 15, 2026

New job on instance eessi-bot-deucalion for repository eessi.io-2025.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2026.03/pr_1389/1034945

date job status comment
Mar 15 07:46:10 UTC 2026 submitted job id 1034945 awaits release by job manager
Mar 15 07:46:57 UTC 2026 released job awaits launch by Slurm scheduler
Mar 15 07:48:00 UTC 2026 running job 1034945 is running
Mar 17 04:03:55 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-1034945.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-a64fx-17737198570.tar.zstsize: 5 MiB (5462507 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/a64fx/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/a64fx/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/a64fx/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260315_075210UTC
tlparse/0.4.0-GCCcore-13.3.0/20260315_075550UTC
other under 2025.06/software/linux/aarch64/a64fx
no other files in tarball
Mar 17 04:03:55 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ SKIP ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:a64fx+default
P: latency: 0.88 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:a64fx+default
P: bandwidth: 8049.99 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 2/4 test case(s) from 4 check(s) (0 failure(s), 2 skipped, 0 aborted)
Details
✅ job output file slurm-1034945.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented Mar 16, 2026

@bedroge I suspect part of the memory problem is related to easybuilders/easybuild-easyblocks#4096 and we also almost certainly want easybuilders/easybuild-easyconfigs#21309 for (some?) ARM CPUs

@migueldiascosta
Copy link
Copy Markdown
Contributor

migueldiascosta commented Mar 16, 2026

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 17, 2026

The build without ACL (#1389 (comment)) failed because of:

== 2026-03-17 03:55:37,772 build_log.py:454 WARNING 19 test failures, 0 test errors (out of 256500):
Failed tests (suites/files):
        distributed/tensor/test_convolution_ops (1 failed, 0 passed, 2 skipped, 0 errors)
        inductor/test_binary_folding (1 failed, 2 passed, 0 skipped, 0 errors)
        inductor/test_compile_subprocess (2 failed, 344 passed, 61 skipped, 0 errors)
        inductor/test_torchinductor (2 failed, 385 passed, 52 skipped, 0 errors)
        inductor/test_torchinductor_codegen_dynamic_shapes (2 failed, 263 passed, 108 skipped, 0 errors)
        inductor/test_torchinductor_dynamic_shapes (2 failed, 347 passed, 120 skipped, 0 errors)
        test_linalg (8 failed, 1116 passed, 121 skipped, 0 errors)
        test_multiprocessing_spawn (1 failed, 27 passed, 3 skipped, 0 errors)

@Flamefire
Copy link
Copy Markdown
Contributor

The build without ACL (#1389 (comment)) failed because of:

== 2026-03-17 03:55:37,772 build_log.py:454 WARNING 19 test failures, 0 test errors (out of 256500):
Failed tests (suites/files):
        distributed/tensor/test_convolution_ops (1 failed, 0 passed, 2 skipped, 0 errors)
        inductor/test_binary_folding (1 failed, 2 passed, 0 skipped, 0 errors)
        inductor/test_compile_subprocess (2 failed, 344 passed, 61 skipped, 0 errors)
        inductor/test_torchinductor (2 failed, 385 passed, 52 skipped, 0 errors)
        inductor/test_torchinductor_codegen_dynamic_shapes (2 failed, 263 passed, 108 skipped, 0 errors)
        inductor/test_torchinductor_dynamic_shapes (2 failed, 347 passed, 120 skipped, 0 errors)
        test_linalg (8 failed, 1116 passed, 121 skipped, 0 errors)
        test_multiprocessing_spawn (1 failed, 27 passed, 3 skipped, 0 errors)

I had ignored the last one because there is some weird timing issue: It basically starts n processes serially doing a sleep and asserting the passed time is at least n*sleeptime which fails with not (7.4>=4*5) which I can't explain. There is a skip for Python >= 3.13.8
I should have fixed test_binary_folding (negligible accuracy difference)
The failures in test_linalg have increased for some reason, the others are new.
the 2 failing in inductor/test_torchinductor* are likely all the same test, so the same issue.

I can take a look at the log again or just increase allowed failures to 20

@boegel
Copy link
Copy Markdown
Contributor

boegel commented Mar 17, 2026

The build without ACL (#1389 (comment)) failed because of:

== 2026-03-17 03:55:37,772 build_log.py:454 WARNING 19 test failures, 0 test errors (out of 256500):
Failed tests (suites/files):
        distributed/tensor/test_convolution_ops (1 failed, 0 passed, 2 skipped, 0 errors)
        inductor/test_binary_folding (1 failed, 2 passed, 0 skipped, 0 errors)
        inductor/test_compile_subprocess (2 failed, 344 passed, 61 skipped, 0 errors)
        inductor/test_torchinductor (2 failed, 385 passed, 52 skipped, 0 errors)
        inductor/test_torchinductor_codegen_dynamic_shapes (2 failed, 263 passed, 108 skipped, 0 errors)
        inductor/test_torchinductor_dynamic_shapes (2 failed, 347 passed, 120 skipped, 0 errors)
        test_linalg (8 failed, 1116 passed, 121 skipped, 0 errors)
        test_multiprocessing_spawn (1 failed, 27 passed, 3 skipped, 0 errors)

I had ignored the last one because there is some weird timing issue: It basically starts n processes serially doing a sleep and asserting the passed time is at least n*sleeptime which fails with not (7.4>=4*5) which I can't explain. There is a skip for Python >= 3.13.8 I should have fixed test_binary_folding (negligible accuracy difference) The failures in test_linalg have increased for some reason, the others are new. the 2 failing in inductor/test_torchinductor* are likely all the same test, so the same issue.

I can take a look at the log again or just increase allowed failures to 20

Generally speaking I would say that this is a very impressive result for the test suite on A64FX...

We should probably take a closer look at the test_linalg failures a bit more closely, but the rest doesn't seem to be a blocker I would say...

@migueldiascosta
Copy link
Copy Markdown
Contributor

@ocaisa

Confirmed that easybuilders/easybuild-easyblocks#4096 is enough: on a64fx with ACL as a dependency, training on CIFAR-100 (the benchmark where we originally found that ACL made a big difference) is ~4.75x times faster than without the ACL dependency

@Flamefire
Copy link
Copy Markdown
Contributor

on a64fx with ACL as a dependency, training on CIFAR-100 (the benchmark where we originally found that ACL made a big difference) is ~4.75x times faster than without the ACL dependency

Then we should add it as an architecture specific dependency.

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Apr 10, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-deucalion
Copy link
Copy Markdown

eessi-bot-deucalion Bot commented Apr 10, 2026

New job on instance eessi-bot-deucalion for repository eessi.io-2025.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2026.04/pr_1389/1138566

date job status comment
Apr 10 18:27:22 UTC 2026 submitted job id 1138566 awaits release by job manager
Apr 10 18:27:38 UTC 2026 released job awaits launch by Slurm scheduler
Apr 10 18:28:41 UTC 2026 running job 1138566 is running
Apr 12 15:36:30 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-1138566.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-a64fx-17760077340.tar.zstsize: 21 MiB (22864572 bytes)
entries: 3719
modules under 2025.06/software/linux/aarch64/a64fx/modules/all
ArmComputeLibrary/25.02-GCCcore-13.3.0.lua
SCons/4.9.0-GCCcore-13.3.0.lua
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/a64fx/software
ArmComputeLibrary/25.02-GCCcore-13.3.0
SCons/4.9.0-GCCcore-13.3.0
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/a64fx/reprod
ArmComputeLibrary/25.02-GCCcore-13.3.0/20260410_185720UTC
SCons/4.9.0-GCCcore-13.3.0/20260410_183326UTC
setuptools/80.9.0-GCCcore-13.3.0/20260410_185844UTC
tlparse/0.4.0-GCCcore-13.3.0/20260410_190316UTC
other under 2025.06/software/linux/aarch64/a64fx
no other files in tarball
Apr 12 15:36:30 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/5) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ SKIP ] (2/5) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ OK ] (3/5) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:a64fx+default
P: latency: 0.88 us (r:0, l:None, u:None)
[ OK ] (4/5) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:a64fx+default
P: bandwidth: 8069.82 MB/s (r:0, l:None, u:None)
[ OK ] (5/5) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/22Jul2025-foss-2024a-kokkos %scale=1_node /ade8cad7 @BotBuildTests:a64fx+default
P: perf: 12.856 timesteps/s (r:0, l:None, u:None)
[ PASSED ] Ran 3/5 test case(s) from 5 check(s) (0 failure(s), 2 skipped, 0 aborted)
Details
✅ job output file slurm-1138566.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented Apr 10, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.04/pr_1389/14638241

date job status comment
Apr 10 18:27:22 UTC 2026 submitted job id 14638241 awaits release by job manager
Apr 10 18:28:18 UTC 2026 released job awaits launch by Slurm scheduler
Apr 10 18:29:22 UTC 2026 running job 14638241 is running
Apr 11 01:14:44 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-14638241.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17758696580.tar.gzsize: 182 MiB (190970986 bytes)
entries: 25501
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
ArmComputeLibrary/25.02-GCCcore-13.3.0.lua
PyTorch/2.9.1-foss-2024a.lua
SCons/4.9.0-GCCcore-13.3.0.lua
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/nvidia/grace/software
ArmComputeLibrary/25.02-GCCcore-13.3.0
PyTorch/2.9.1-foss-2024a
SCons/4.9.0-GCCcore-13.3.0
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
ArmComputeLibrary/25.02-GCCcore-13.3.0/20260410_183500UTC
PyTorch/2.9.1-foss-2024a/20260411_010545UTC
SCons/4.9.0-GCCcore-13.3.0/20260410_183048UTC
setuptools/80.9.0-GCCcore-13.3.0/20260410_183514UTC
tlparse/0.4.0-GCCcore-13.3.0/20260410_183653UTC
other under 2025.06/software/linux/aarch64/nvidia/grace
no other files in tarball
Apr 11 01:14:44 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] ( 1/13) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=gpu /6d7a17a9 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 2/13) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=gpu /e9b09ad8 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 3/13) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /a102bba0 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 4/13) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /d58e51e9 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ OK ] ( 5/13) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/22Jul2025-foss-2024a-kokkos %scale=1_node /ade8cad7 @BotBuildTests:aarch64-nvidia-grace+default
P: perf: 1548.616 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 6/13) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 2.56 us (r:0, l:None, u:None)
[ OK ] ( 7/13) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=cpu /0c56f933 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 3.37 us (r:0, l:None, u:None)
[ OK ] ( 8/13) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 6.22 us (r:0, l:None, u:None)
[ OK ] ( 9/13) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=cpu /ca426177 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 11.21 us (r:0, l:None, u:None)
[ OK ] (10/13) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.25 us (r:0, l:None, u:None)
[ OK ] (11/13) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /af5b485c @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.33 us (r:0, l:None, u:None)
[ OK ] (12/13) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 18781.09 MB/s (r:0, l:None, u:None)
[ OK ] (13/13) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /ebc0c2c2 @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 18710.67 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 9/13 test case(s) from 13 check(s) (0 failure(s), 4 skipped, 0 aborted)
Details
✅ job output file slurm-14638241.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Jun 3, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-deucalion
Copy link
Copy Markdown

eessi-bot-deucalion Bot commented Jun 3, 2026

New job on instance eessi-bot-deucalion for repository eessi.io-2025.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2026.06/pr_1389/1389610

date job status comment
Jun 03 13:09:47 UTC 2026 submitted job id 1389610 awaits release by job manager
Jun 03 13:10:25 UTC 2026 released job awaits launch by Slurm scheduler
Jun 03 13:11:29 UTC 2026 running job 1389610 is running

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented Jun 3, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_1389/14855677

date job status comment
Jun 03 13:09:47 UTC 2026 submitted job id 14855677 awaits release by job manager
Jun 03 13:10:34 UTC 2026 released job awaits launch by Slurm scheduler
Jun 03 13:11:40 UTC 2026 running job 14855677 is running
Jun 04 02:07:19 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14855677.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17805376410.tar.gzsize: 29 MiB (31123580 bytes)
entries: 3719
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
ArmComputeLibrary/25.02-GCCcore-13.3.0.lua
SCons/4.9.0-GCCcore-13.3.0.lua
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/nvidia/grace/software
ArmComputeLibrary/25.02-GCCcore-13.3.0
SCons/4.9.0-GCCcore-13.3.0
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
ArmComputeLibrary/25.02-GCCcore-13.3.0/20260603_132736UTC
SCons/4.9.0-GCCcore-13.3.0/20260603_131833UTC
setuptools/80.9.0-GCCcore-13.3.0/20260603_132752UTC
tlparse/0.4.0-GCCcore-13.3.0/20260603_133149UTC
other under 2025.06/software/linux/aarch64/nvidia/grace
no other files in tarball
Jun 04 02:07:19 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 17/29 test case(s) from 29 check(s) (4 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-14855677.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Jun 5, 2026

Using the updated eaasyconfig from easybuilders/easybuild-easyconfigs#25726, the NVIDIA Grace build now fails due to too many failing tests:

== FAILED: Installation ended unsuccessfully: An error was raised during test 
step: Failing because not all failed tests could be determined.Tests failed to 
start, crashed or the test accounting in the PyTorch EasyBlock needs updating!
Missing (3): nn/test_convolution, test_dataloader, test_quantization
You can check the test failures (in the log) manually and if they are harmless, 
use --ignore-test-failure to make the test step pass.
59 test failures, 0 test errors (out of 261432):
Failed tests (suites/files):
        inductor/test_cpu_select_algorithm (54 failed, 36 passed, 1620 skipped, 
0 errors)
        profiler/test_memory_profiler (1 failed, 32 passed, 0 skipped, 0 errors)
        test_linalg (2 failed, 1122 passed, 121 skipped, 0 errors)
        test_multiprocessing_spawn (1 failed, 27 passed, 3 skipped, 0 errors)
        test_scatter_gather_ops (1 failed, 83 passed, 0 skipped, 0 errors)
Could not count failed tests for the following test suites/files:
        nn/test_convolution (Undetected or did not run properly)
        test_dataloader (Undetected or did not run properly)
        test_quantization (Undetected or did not run properly) (took 12 hours 10
mins 39 secs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2025.06-software.eessi.io 2025.06 version of software.eessi.io

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants