Issue: Multi-node and Multi-GPU Inference Problems with DeepSpeed MII

**Problem Description**
I am using DeepSpeed MII to perform sharding and multi-node inference with generative models. The objective is to distribute a model across two nodes (2 GPUs per node, total of 4 GPUs, each with ~24GB VRAM) and read prompts from JSON files in an input folder to generate responses, which are then saved in an output folder.

However, depending on the model used, I encounter various issues:

**1. With the Qwen-32B model:**

- Initial responses are correct.
- After a random number of iterations (even with the same prompt), the code hangs indefinitely during the response generation step, with no errors.

**2. With Llama 3.1 8B:**

- In single-node mode, everything works perfectly.
- In multi-node mode, the code does not hang as with Qwen, but the responses are garbled or incorrect. For example:

Prompt: "What is the sun?"
Response: "The sun is a str comTi asTur forBas al aaall wehnd us" (randomly scrambled words).

**3. With Mistral 7B Instruct v0.3:**

- The code hangs after only a few iterations.
- Responses are partially scrambled, similar to the Llama case.

**Troubleshooting Attempts:**

- I have tried several things to address these issues, but the following are particularly confusing and raise more doubts than solutions:
- Adding/Removing torch.distributed.barrier(): I attempted to synchronize processes using torch.distributed.barrier() both before and after the inference step. However, this did not resolve the hanging or the garbled responses.
- Modifying the all_rank_output Parameter: I experimented with enabling and disabling all_rank_output during the pipeline initialization. This also did not resolve the issues.

**System Configuration:**

**- hostifile:**
xxxx.xxx.xxx.xxx slots=2
yyyy.yyy.yyy.yyy slots=2

**- Execution Commands:** 
Node0: deepspeed --hostfile=hostfile --no_ssh --node_rank=0 --master_addr=xxxx.xxx.xxx.xxx --master_port=xxxx multinode_dynamic_inference.py

Node1: deepspeed --hostfile=hostfile --no_ssh --node_rank=1 --master_addr=xxxx.xxx.xxx.xxx --master_port=xxxx multinode_dynamic_inference.py

**- Code Used**

 ```
import json
import os
from pathlib import Path
from time import sleep
import time
import torch
import mii
import gc

# Paths for input and output files
IN_REQUEST_PATH = Path("/path/to/input/")
OUT_REQUEST_PATH = Path("/path/to/output/")

# Local and global rank
local_rank = int(os.getenv("LOCAL_RANK", "-1"))
global_rank = int(os.getenv("RANK", "-1"))

# Initialize the model pipeline
pipe = mii.pipeline("/path/to/model/", all_rank_output=True)

iteration = 0

while True:
    print(iteration)
    iteration += 1

    print(f"GPU memory allocated: {torch.cuda.memory_allocated()}")
    print(f"GPU memory reserved: {torch.cuda.memory_reserved()}")

    # Process input files
    request_paths = list(IN_REQUEST_PATH.iterdir())
    print(f"LOCAL RANK {local_rank}, GLOBAL RANK {global_rank}")
    
    if len(request_paths) > 0:
        requests = [json.loads(path.read_text(encoding="utf-8")) for path in request_paths]
        prompts = [r["prompt"] for r in requests]

        # Perform inference
        start_time = time.time()
        responses = pipe(prompts, max_new_tokens=128)  
        end_time = time.time()
        print(f"Inference time: {end_time - start_time:.2f} seconds")

        # Write results
        if global_rank == 0:
            print("Printing output")
            Path("./responses.json").write_text("\n\n\n".join([r.generated_text for r in responses]))
            
            for request, response in zip(requests, responses):
                request["response"] = response.generated_text
                Path(OUT_REQUEST_PATH / f"{request['id']}.json").write_text(
                    json.dumps(request, ensure_ascii=False), encoding="utf-8"
                )

    # Clear GPU cache
    torch.cuda.empty_cache()
    gc.collect()
    sleep(10)
 ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: Multi-node and Multi-GPU Inference Problems with DeepSpeed MII #545

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue: Multi-node and Multi-GPU Inference Problems with DeepSpeed MII #545

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions