Settings

Configure the local inference backend

Engine configuration

HostPortModel override (optional)

MLX backend

Apple's MLX framework runs models natively on the Apple Silicon GPU via unified memory. It uses the OpenAI-compatible server format, so the API endpoint is http://127.0.0.1:3000/v1/chat/completions.

Install mlx-lm

Install the MLX Python package and its LM server:

pip install mlx-lm

For optimal performance on M4 Max, also install mlx and ensure you are on macOS Sonoma or later with the latest Xcode Command Line Tools.

Download a model

MLX-community provides pre-converted models. For 128GB workstations, the 70B and 120B models are excellent choices:

# 70B model at 8-bit (~42 GB) - best for most tasks
huggingface-cli download mlx-community/Llama-3.1-70B-Instruct-8bit

# 120B model at 4-bit (~66 GB) - maximum capability
huggingface-cli download mlx-community/Command-R-Plus-120B-4bit

# Or use without downloading - mlx-lm will cache automatically

Start the MLX server

Optimized for M4 Max with 128GB unified memory. The server exposes an OpenAI-compatible API at http://127.0.0.1:3000:

python -m mlx_lm.server \
  --model mlx-community/Llama-3.1-70B-Instruct-8bit \
  --host 127.0.0.1 \
  --port 3000

Tip: For very large models (120B), pass --cache-limit-gb 100 to prevent memory pressure. The M4 Max's 128GB handles 120B at 4-bit comfortably with ~50GB to spare for system processes and context.

Test the API

Once the server is running, this endpoint is fully OpenAI-compatible. Use any OpenAI client library by setting base_url to http://127.0.0.1:3000/v1:

curl http://127.0.0.1:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Monitor inference stats

The chat interface monitors MLX performance automatically. Look for tokens/second throughput, peak memory usage, and weight loading times in the Dashboard and inline chat stats. With the 128GB M4 Max, expect:

70B @ 8-bit: 25-40 tokens/s, ~42 GB peak
120B @ 4-bit: 15-25 tokens/s, ~66 GB peak
8B @ 8-bit: 80-120 tokens/s, ~5 GB peak