Untitled

 avatar
unknown
plain_text
2 years ago
18 kB
6
Indexable
It has something to do with cuda version i think
ishfaqha@login-1:~/code/LangevinMCRL/output/atari1_fgts$ cat 1.txt 
Current working directory: /home/mila/i/ishfaqha/code/LangevinMCRL
Starting run at: Wed Jul 26 15:40:35 EDT 2023
Job Array ID / Job ID: 3431041 / 3431041
This is job 1 out of 1 jobs
SLURM_TMPDIR: /Tmp/slurm.3431041.0
SLURM_JOB_NODELIST: cn-g017
[=== Module miniconda/3 loaded ===]
/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/cuda/__init__.py:146: UserWarning: 
NVIDIA A100-SXM4-80GB MIG 3g.40gb with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-SXM4-80GB MIG 3g.40gb GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/cuda/__init__.py:146: UserWarning: 
NVIDIA A100-SXM4-80GB MIG 3g.40gb with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-SXM4-80GB MIG 3g.40gb GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/cuda/__init__.py:146: UserWarning: 
NVIDIA A100-SXM4-80GB MIG 3g.40gb with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-SXM4-80GB MIG 3g.40gb GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
  File "main.py", line 58, in <module>
    main(sys.argv)
  File "main.py", line 55, in main
    exp.run()
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/experiment.py", line 43, in run
    self.agent.run_steps()
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/agents/FGTSDQN.py", line 81, in run_steps
    self.collectors['Train'].collect(n_step=self.batch_size * self.cfg['env']['train_num'])
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/tianshou/data/collector.py", line 297, in collect
    result = self.policy(self.data, last_state)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/components/ensemble_langevin_dqn/policy.py", line 97, in forward
    logits, hidden = model(obs_next, state=state, info=batch.info, head=head)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/components/ensemble_langevin_dqn/network.py", line 82, in forward
    logits, state = self.nets[head](obs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/components/ensemble_langevin_dqn/network.py", line 104, in forward
    q_values = self._network(obs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "main.py", line 58, in <module>
    main(sys.argv)
  File "main.py", line 55, in main
    exp.run()
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/experiment.py", line 43, in run
    self.agent.run_steps()
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/agents/FGTSDQN.py", line 81, in run_steps
    self.collectors['Train'].collect(n_step=self.batch_size * self.cfg['env']['train_num'])
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/tianshou/data/collector.py", line 297, in collect
    result = self.policy(self.data, last_state)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/components/ensemble_langevin_dqn/policy.py", line 97, in forward
    logits, hidden = model(obs_next, state=state, info=batch.info, head=head)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/components/ensemble_langevin_dqn/network.py", line 82, in forward
    logits, state = self.nets[head](obs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/components/ensemble_langevin_dqn/network.py", line 104, in forward
    q_values = self._network(obs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "main.py", line 58, in <module>
    main(sys.argv)
  File "main.py", line 55, in main
    exp.run()
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/experiment.py", line 43, in run
    self.agent.run_steps()
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/agents/FGTSDQN.py", line 81, in run_steps
    self.collectors['Train'].collect(n_step=self.batch_size * self.cfg['env']['train_num'])
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/tianshou/data/collector.py", line 297, in collect
    result = self.policy(self.data, last_state)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/components/ensemble_langevin_dqn/policy.py", line 97, in forward
    logits, hidden = model(obs_next, state=state, info=batch.info, head=head)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/components/ensemble_langevin_dqn/network.py", line 82, in forward
    logits, state = self.nets[head](obs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/components/ensemble_langevin_dqn/network.py", line 104, in forward
    q_values = self._network(obs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/cuda/__init__.py:146: UserWarning: 
NVIDIA A100-SXM4-80GB MIG 3g.40gb with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-SXM4-80GB MIG 3g.40gb GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
  File "main.py", line 58, in <module>
    main(sys.argv)
  File "main.py", line 55, in main
    exp.run()
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/experiment.py", line 43, in run
    self.agent.run_steps()
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/agents/FGTSDQN.py", line 81, in run_steps
    self.collectors['Train'].collect(n_step=self.batch_size * self.cfg['env']['train_num'])
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/tianshou/data/collector.py", line 297, in collect
    result = self.policy(self.data, last_state)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/components/ensemble_langevin_dqn/policy.py", line 97, in forward
    logits, hidden = model(obs_next, state=state, info=batch.info, head=head)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/components/ensemble_langevin_dqn/network.py", line 82, in forward
    logits, state = self.nets[head](obs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/code/LangevinMCRL/components/ensemble_langevin_dqn/network.py", line 104, in forward
    q_values = self._network(obs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/mila/i/ishfaqha/envs/tianshou/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Job finished with exit code 4 at: Wed Jul 26 15:40:48 EDT 2023
Copy log files from temporary directory
Source directory: /Tmp/slurm.3431041.0/atari1_fgts/.
Destination directory: ./logs/atari1_fgts/

======== GPU REPORT ========

==============NVSMI LOG==============

Timestamp                                 : Wed Jul 26 15:40:49 2023
Driver Version                            : 515.65.01
CUDA Version                              : 11.7

Attached GPUs                             : 1
GPU 00000000:81:00.0
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Accounted Processes                   : None

Wed Jul 26 15:40:49 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                   On |
| N/A   28C    P0    80W / 500W |     24MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
(tianshou) ishfaqha@login-1:~/code/LangevinMCRL/output/atari1_fgts$ 
Editor is loading...