NVIDIA Apex provides some custom fused operators for PyTorch that can increase the speed of training various models. The fused operator I am most interested in is the FusedLAMB optimizer. The LAMB optimizer has been shown to stabilize pre-training of large models using large batch sizes. I have had a difficult time getting this package installed since it needs to be built from source and there is no dedicated Windows support. I recently tried again and was able to get it built with CUDA extensions. The process is outlined below.
Installing NVIDIA Apex
I found the following is enough to install NVIDIA Apex on Windows 11 assuming you already have the Visual Studio C extensions installed for your system. For some reason the current commit on the main branch breaks the install for Windows, but reverting to an earlier commit still works. It just requires the modification of a couple files after the install. This is due to the deprecation of the torch._six module.
python -m pip install --upgrade pip
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y --copy
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit cuda-nvcc -y --copy
git clone https://github.com/NVIDIA/apex
cd apex
git checkout 2ec84eb
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
After the install completes you need to replace in _amp_state.py
if TORCH_MAJOR == 0:
import collections.abc as container_abcs
else:
from torch._six import container_abcs
with
import collections.abc as container_abcs
Finally, replace in _initialize.py
from torch._six import string_classes
with
string_classes = str
Hopefully this helps out someone else who has been struggling to get this working.