Skip to content

Pytest segmentation fault when PyORBIT is compiled with MPI  #126

Description

@woodtp

Running pytest when -DUSE_MPI=true results in seg fault.
Example stack trace:

Current thread's C stack trace (most recent call first):
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at _Py_DumpStack+0x48 [0x1009c3a54]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at faulthandler_dump_c_stack+0x58 [0x1009d8914]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at faulthandler_fatal_error+0x17c [0x1009d8854]
  Binary file "/usr/lib/system/libsystem_platform.dylib", at _sigtramp+0x38 [0x18555b744]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/lib/libopen-pal.80.dylib", at opal_argv_join+0x40 [0x1028d6830]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/lib/libmpi.40.dylib", at ompi_rte_init+0xb10 [0x103aeefd0]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/lib/libmpi.40.dylib", at ompi_mpi_instance_init+0x1a8 [0x103af2094]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/lib/libmpi.40.dylib", at ompi_mpi_init+0xd0 [0x103aebb58]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/lib/libmpi.40.dylib", at MPI_Init+0x84 [0x103b59f44]
  Binary file "/Users//opt/pyorbit3/build/cp314/src/libcore.dylib", at _Z14ORBIT_MPI_Initv+0xfc [0x102ea8070]
  Binary file "/Users//opt/pyorbit3/build/cp314/src/libcore.dylib", at initorbit_mpi+0x18 [0x102ea41dc]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at _PyImport_RunModInitFunc+0x68 [0x1009740e4]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at import_run_extension+0x6c [0x100971180]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at _imp_create_dynamic+0x338 [0x100972fb0]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at _TAIL_CALL_CALL_FUNCTION_EX+0x100 [0x1008ffe34]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at _PyFunction_Vectorcall+0x2f0 [0x100780520]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at _PyObject_VectorcallTstate.832+0x50 [0x10077dd54]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at object_vacall+0xe0 [0x100782c78]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at PyObject_CallMethodObjArgs+0x84 [0x100782ad0]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at import_find_and_load+0x22c [0x1009709d8]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at PyImport_ImportModuleLevelObject+0xc64 [0x1009704b4]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at builtin___import__+0x104 [0x1008f5db8]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at _TAIL_CALL_CALL_FUNCTION_EX+0x100 [0x1008ffe34]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at _PyFunction_Vectorcall+0x2f0 [0x100780520]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at _PyObject_VectorcallTstate.832+0x50 [0x10077dd54]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at object_vacall+0xe0 [0x100782c78]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at PyObject_CallMethodObjArgs+0x84 [0x100782ad0]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at PyImport_ImportModuleLevelObject+0x4b4 [0x10096fd04]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at _TAIL_CALL_IMPORT_NAME+0x3c [0x100909cb0]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at _PyEval_Vector+0x2c8 [0x1008fd584]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at PyEval_EvalCode+0xb0 [0x1008fd1fc]
  Binary file "/opt/homebrew/Caskroom/miniforge/base/envs/pyorbit/bin/python3.14", at builtin_exec+0x16c [0x1008f7a2c]
  <truncated rest of calls>

I recompiled with the -fsanitize=address,undefined and isolated the problem to MPI initialization (ORBIT_MPI_Init):

PyObject* sys_module = PyImport_ImportModule("sys");
PyObject* argv_list = PyObject_GetAttrString(sys_module, "argv");
// Check if argv_list is a list
if (PyList_Check(argv_list)) {
// Access individual command-line arguments
len = PyList_Size(argv_list);
ch = (char**) malloc(sizeof(char*) * len);
for (Py_ssize_t i = 0; i < len; ++i) {
PyObject* item = PyList_GetItem(argv_list, i);
if (item && PyUnicode_Check(item)) {
ch[i] = const_cast<char*>(PyUnicode_AsUTF8(item));
}
}
}
// Release references
Py_XDECREF(argv_list);
Py_XDECREF(sys_module);
res = MPI_Init(&len,&ch);

MPI_Init is receiving arguments parsed from the CLI, which means you can trigger the crash by appending anything after a script:

cd example/SNS_Linac/pyorbit3_linac_model
python pyorbit3_sns_linac_mebt_hebt2.py thiswillsegfault

The kind of weird thing is that if you run the script a few times in quick succession, MPI will sometimes successfully initialize...
The source of the segfault is the fact that the char** array is not null-terminated, so you can fix it by just increasing the length of the character array by 1, and sticking a NULL at the end.

len = PyList_Size(argv_list) + 1;
ch = (char**) malloc(sizeof(char*) * len);

...

ch[len] = NULL;

Regardless, according to the OpenMPI docs:

Open MPI’s MPI_Init and MPI_Init_thread both accept the C argc and argv arguments to main, but neither modifies, interprets, nor distributes them:

MPICH doesn't seem to require it either:

As of MPI-2, MPI_Init will accept NULL as input parameters. Doing so will impact the values stored in MPI_INFO_ENV.

Seems like the ability to forward argc and argv through MPI_Init is somewhat vestigial and initialization is handled by some other mechanism such as environment variables likely set by mpirun.
Therefore, the better solution is probably to refactor ORBIT_MPI_Init to just call to MPI_Init with NULL arguments and not read in from sys.argv at all.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions