Settings
Configure the local inference backend
Engine configuration
MLX backend
Apple's MLX framework runs models natively on the Apple Silicon GPU via unified memory. It uses the OpenAI-compatible server format, so the API endpoint is http://127.0.0.1:3000/v1/chat/completions.
Install mlx-lm
Install the MLX Python package and its LM server:
pip install mlx-lm
For optimal performance on M4 Max, also install mlx and ensure you are on macOS Sonoma or later with the latest Xcode Command Line Tools.
Download a model
MLX-community provides pre-converted models. For 128GB workstations, the 70B and 120B models are excellent choices:
# 70B model at 8-bit (~42 GB) - best for most tasks huggingface-cli download mlx-community/Llama-3.1-70B-Instruct-8bit # 120B model at 4-bit (~66 GB) - maximum capability huggingface-cli download mlx-community/Command-R-Plus-120B-4bit # Or use without downloading - mlx-lm will cache automatically
Start the MLX server
Optimized for M4 Max with 128GB unified memory. The server exposes an OpenAI-compatible API at http://127.0.0.1:3000:
python -m mlx_lm.server \ --model mlx-community/Llama-3.1-70B-Instruct-8bit \ --host 127.0.0.1 \ --port 3000
Tip: For very large models (120B), pass --cache-limit-gb 100 to prevent memory pressure. The M4 Max's 128GB handles 120B at 4-bit comfortably with ~50GB to spare for system processes and context.
Test the API
Once the server is running, this endpoint is fully OpenAI-compatible. Use any OpenAI client library by setting base_url to http://127.0.0.1:3000/v1:
curl http://127.0.0.1:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Hello!"}]
}'Monitor inference stats
The chat interface monitors MLX performance automatically. Look for tokens/second throughput, peak memory usage, and weight loading times in the Dashboard and inline chat stats. With the 128GB M4 Max, expect:
- 70B @ 8-bit: 25-40 tokens/s, ~42 GB peak
- 120B @ 4-bit: 15-25 tokens/s, ~66 GB peak
- 8B @ 8-bit: 80-120 tokens/s, ~5 GB peak