Skip to content

Issue: Multi-node and Multi-GPU Inference Problems with DeepSpeed MII #545

@lcnmzz00

Description

@lcnmzz00

Problem Description
I am using DeepSpeed MII to perform sharding and multi-node inference with generative models. The objective is to distribute a model across two nodes (2 GPUs per node, total of 4 GPUs, each with ~24GB VRAM) and read prompts from JSON files in an input folder to generate responses, which are then saved in an output folder.

However, depending on the model used, I encounter various issues:

1. With the Qwen-32B model:

  • Initial responses are correct.
  • After a random number of iterations (even with the same prompt), the code hangs indefinitely during the response generation step, with no errors.

2. With Llama 3.1 8B:

  • In single-node mode, everything works perfectly.
  • In multi-node mode, the code does not hang as with Qwen, but the responses are garbled or incorrect. For example:

Prompt: "What is the sun?"
Response: "The sun is a str comTi asTur forBas al aaall wehnd us" (randomly scrambled words).

3. With Mistral 7B Instruct v0.3:

  • The code hangs after only a few iterations.
  • Responses are partially scrambled, similar to the Llama case.

Troubleshooting Attempts:

  • I have tried several things to address these issues, but the following are particularly confusing and raise more doubts than solutions:
  • Adding/Removing torch.distributed.barrier(): I attempted to synchronize processes using torch.distributed.barrier() both before and after the inference step. However, this did not resolve the hanging or the garbled responses.
  • Modifying the all_rank_output Parameter: I experimented with enabling and disabling all_rank_output during the pipeline initialization. This also did not resolve the issues.

System Configuration:

- hostifile:
xxxx.xxx.xxx.xxx slots=2
yyyy.yyy.yyy.yyy slots=2

- Execution Commands:
Node0: deepspeed --hostfile=hostfile --no_ssh --node_rank=0 --master_addr=xxxx.xxx.xxx.xxx --master_port=xxxx multinode_dynamic_inference.py

Node1: deepspeed --hostfile=hostfile --no_ssh --node_rank=1 --master_addr=xxxx.xxx.xxx.xxx --master_port=xxxx multinode_dynamic_inference.py

- Code Used

import json
import os
from pathlib import Path
from time import sleep
import time
import torch
import mii
import gc

# Paths for input and output files
IN_REQUEST_PATH = Path("/path/to/input/")
OUT_REQUEST_PATH = Path("/path/to/output/")

# Local and global rank
local_rank = int(os.getenv("LOCAL_RANK", "-1"))
global_rank = int(os.getenv("RANK", "-1"))

# Initialize the model pipeline
pipe = mii.pipeline("/path/to/model/", all_rank_output=True)

iteration = 0

while True:
   print(iteration)
   iteration += 1

   print(f"GPU memory allocated: {torch.cuda.memory_allocated()}")
   print(f"GPU memory reserved: {torch.cuda.memory_reserved()}")

   # Process input files
   request_paths = list(IN_REQUEST_PATH.iterdir())
   print(f"LOCAL RANK {local_rank}, GLOBAL RANK {global_rank}")
   
   if len(request_paths) > 0:
       requests = [json.loads(path.read_text(encoding="utf-8")) for path in request_paths]
       prompts = [r["prompt"] for r in requests]

       # Perform inference
       start_time = time.time()
       responses = pipe(prompts, max_new_tokens=128)  
       end_time = time.time()
       print(f"Inference time: {end_time - start_time:.2f} seconds")

       # Write results
       if global_rank == 0:
           print("Printing output")
           Path("./responses.json").write_text("\n\n\n".join([r.generated_text for r in responses]))
           
           for request, response in zip(requests, responses):
               request["response"] = response.generated_text
               Path(OUT_REQUEST_PATH / f"{request['id']}.json").write_text(
                   json.dumps(request, ensure_ascii=False), encoding="utf-8"
               )

   # Clear GPU cache
   torch.cuda.empty_cache()
   gc.collect()
   sleep(10)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions