DashboardModelsChatSettings

Settings

Configure the local inference backend

Engine configuration

MLX backend

Apple's MLX framework runs models natively on the Apple Silicon GPU via unified memory. It uses the OpenAI-compatible server format, so the API endpoint is http://127.0.0.1:3000/v1/chat/completions.

1

Install mlx-lm

Install the MLX Python package and its LM server:

pip install mlx-lm

For optimal performance on M4 Max, also install mlx and ensure you are on macOS Sonoma or later with the latest Xcode Command Line Tools.

2

Download a model

MLX-community provides pre-converted models. For 128GB workstations, the 70B and 120B models are excellent choices:

# 70B model at 8-bit (~42 GB) - best for most tasks
huggingface-cli download mlx-community/Llama-3.1-70B-Instruct-8bit

# 120B model at 4-bit (~66 GB) - maximum capability
huggingface-cli download mlx-community/Command-R-Plus-120B-4bit

# Or use without downloading - mlx-lm will cache automatically
3

Start the MLX server

Optimized for M4 Max with 128GB unified memory. The server exposes an OpenAI-compatible API at http://127.0.0.1:3000:

python -m mlx_lm.server \
  --model mlx-community/Llama-3.1-70B-Instruct-8bit \
  --host 127.0.0.1 \
  --port 3000

Tip: For very large models (120B), pass --cache-limit-gb 100 to prevent memory pressure. The M4 Max's 128GB handles 120B at 4-bit comfortably with ~50GB to spare for system processes and context.

4

Test the API

Once the server is running, this endpoint is fully OpenAI-compatible. Use any OpenAI client library by setting base_url to http://127.0.0.1:3000/v1:

curl http://127.0.0.1:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
5

Monitor inference stats

The chat interface monitors MLX performance automatically. Look for tokens/second throughput, peak memory usage, and weight loading times in the Dashboard and inline chat stats. With the 128GB M4 Max, expect:

  • 70B @ 8-bit: 25-40 tokens/s, ~42 GB peak
  • 120B @ 4-bit: 15-25 tokens/s, ~66 GB peak
  • 8B @ 8-bit: 80-120 tokens/s, ~5 GB peak