Troubleshooting¶
Docker Training Freezes¶
If training stops and nvidia-smi inside the container returns Failed to initialize NVML: Unknown Error, while nvidia-smi still works on the host, the container has lost GPU access. This is a known NVIDIA Container Toolkit issue, commonly seen when Docker uses the systemd cgroup driver.
Check¶
nvidia-smi
docker info --format '{{.CgroupDriver}} {{.CgroupVersion}}'
docker exec -it <container_name> nvidia-smi
You are likely hitting this issue if:
- host
nvidia-smiworks - container
nvidia-smifails - Docker reports
systemd 2
If host nvidia-smi also fails, this is a host GPU or driver problem instead.
Recover¶
Recreate the container:
If you are using a managed training session, restart the session after the container comes back.
Mitigate¶
NVIDIA recommends switching Docker from systemd to cgroupfs. Update /etc/docker/daemon.json
on the host:
{
"exec-opts": ["native.cgroupdriver=cgroupfs"],
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
Then restart Docker and recreate the container:
Notes¶
- This is a Docker / NVIDIA runtime issue, not usually a model issue.
- Validate the Docker config change on your machine before adopting it broadly.
- Reference: NVIDIA Container Toolkit troubleshooting