GPU-on-Demand: Wake-on-LAN for ML Workloads

#The Problem: GPUs Are Power Hungry

I run a personal knowledge system called Recall that indexes 2000+ meeting notes using vector embeddings. The catch? Generating embeddings on my NAS (CPU-only) takes 20+ hours for a full reindex.

My GPU PC has an RTX 5090 that can do the same job in under 5 minutes. But leaving a 1000W+ system running 24/7 just for occasional ML jobs? That's:

~$300/year in electricity
Constant fan noise
Unnecessary heat in my office

I needed GPU power on-demand, not 24/7.

#The Solution: Wake-on-LAN + Auto-Shutdown

The idea is simple:

Wake the GPU PC only when there's work to do
Run the ML workload (embeddings, inference, image generation)
Shutdown automatically when done

Here's how I built it.

##Architecture Overview

┌─────────────────┐     WoL Packet      ┌─────────────────┐
│      NAS        │ ──────────────────► │    GPU PC       │
│  (K3s cluster)  │                     │  (RTX 5090)     │
│                 │ ◄────────────────── │                 │
│   Recall API    │    Ollama API       │   Ollama        │
│                 │    (embeddings)     │   ComfyUI       │
│                 │ ──────────────────► │                 │
│   Cron/Agent    │   Shutdown API      │  Shutdown Svc   │
└─────────────────┘                     └─────────────────┘

##Step 1: WoL Server on the NAS

Wake-on-LAN requires sending a "magic packet" to the GPU PC's MAC address. I run a simple Python server on the NAS:

# wol-server.py
from flask import Flask, request, jsonify
from wakeonlan import send_magic_packet

app = Flask(__name__)

@app.route('/wake', methods=['POST'])
def wake():
    mac = request.json.get('mac')
    send_magic_packet(mac)
    return jsonify({"status": "sent", "mac": mac})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=9753)

Important: WoL only works on the same Layer 2 network. If your NAS and GPU PC are on different VLANs, you'll need a relay.

##Step 2: Shutdown Server on the GPU PC

The GPU PC runs a tiny HTTP server that accepts authenticated shutdown requests:

# shutdown-server.py
from flask import Flask, request, jsonify
import subprocess

app = Flask(__name__)
SHUTDOWN_TOKEN = "your-secret-token"

@app.route('/shutdown', methods=['POST'])
def shutdown():
    token = request.headers.get('Authorization', '').replace('Bearer ', '')
    if token != SHUTDOWN_TOKEN:
        return jsonify({"error": "unauthorized"}), 401
    
    subprocess.Popen(['shutdown', '-h', 'now'])  # Linux
    return jsonify({"status": "shutting down"})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8765)

This runs as a systemd service that starts on boot.

##Step 3: The Orchestration Script

My daily sync script ties it all together:

def run_gpu_reindex():
    # 1. Check if there are new files to index
    has_new, count, total = check_for_new_files()
    if not has_new:
        print("No new files - skipping GPU wake")
        return
    
    # 2. Wake the GPU PC
    requests.post("http://nas:9753/wake", json={"mac": "AA:BB:CC:DD:EE:FF"})
    
    # 3. Wait for Ollama to be ready
    for _ in range(36):  # 3 minutes
        try:
            r = requests.get("http://gpu-pc:11434/api/tags", timeout=5)
            if r.status_code == 200:
                break
        except:
            time.sleep(5)
    
    # 4. Run the indexing job (uses GPU Ollama)
    requests.post("http://nas:30889/index/start", json={"full": True})
    
    # 5. Wait for completion
    while True:
        progress = requests.get("http://nas:30889/index/progress").json()
        if progress["status"] != "running":
            break
        time.sleep(30)
    
    # 6. Shutdown GPU PC
    requests.post(
        "http://gpu-pc:8765/shutdown",
        headers={"Authorization": "Bearer your-secret-token"}
    )

##Step 4: Smart Pre-checks

The key optimization: don't wake the GPU if there's nothing to do.

I track the last successful index timestamp and compare file modification times:

def check_for_new_files():
    last_index = load_last_index_time()
    new_files = [f for f in vault.glob("*.md") 
                 if f.stat().st_mtime > last_index]
    return len(new_files) > 0, len(new_files), total_files

This means my GPU PC only wakes when there are actual changes to process.

#Beyond Embeddings: ComfyUI for Image Generation

Once I had on-demand GPU access working for embeddings, I realized the same pattern works for image and video generation.

I run ComfyUI on the GPU PC alongside Ollama. When my AI assistant needs to generate an image, it:

Wakes the GPU PC
Waits for ComfyUI to be ready (port 8188)
Submits the workflow via API
Retrieves the generated image
Shuts down (or leaves running if more requests are expected)

##ComfyUI Client

class ComfyUIClient:
    def __init__(self, host="gpu-pc", port=8188):
        self.url = f"http://{host}:{port}"
    
    def generate_image(self, prompt, workflow="sdxl-txt2img"):
        # Load and customize workflow
        workflow_data = load_workflow(workflow)
        workflow_data["6"]["inputs"]["text"] = prompt
        workflow_data["3"]["inputs"]["seed"] = random.randint(0, 2**32)
        
        # Submit to ComfyUI
        resp = requests.post(f"{self.url}/prompt", 
                           json={"prompt": workflow_data})
        prompt_id = resp.json()["prompt_id"]
        
        # Poll for completion
        while True:
            history = requests.get(f"{self.url}/history/{prompt_id}").json()
            if prompt_id in history:
                break
            time.sleep(0.5)
        
        # Fetch and return image
        image_info = history[prompt_id]["outputs"]["9"]["images"][0]
        return self.download_image(image_info)

##Video Generation with Wan 2.1

The RTX 5090 has enough VRAM (32GB) to run Wan 2.1 for text-to-video:

def generate_video(self, prompt):
    workflow = load_workflow("wan-t2v-mp4")
    workflow["4"]["inputs"]["text"] = prompt
    
    # Video generation takes ~2.5 minutes
    resp = self.submit_and_wait(workflow, timeout=300)
    return self.download_video(resp)

Task	Time	Notes
SDXL image	~4s	1024x1024
Video (33 frames)	~2.5 min	832x480, 20 steps

The power-on latency (~60s boot + ~30s ComfyUI load) is acceptable for creative tasks where I'm not in a hurry.

#Results

Metric	Before	After
Full reindex time	20+ hours	5 minutes
Image generation	Cloud API costs	Free (local)
GPU PC uptime	24/7	~10-30 min/day
Monthly power cost	~$25	~$2-3
Heat output	Constant	Minimal

#Gotchas and Lessons

1. WoL Requires BIOS Setup Enable "Wake on LAN" in your BIOS/UEFI. Also enable it in your OS network settings.

2. WoL Doesn't Cross Subnets Magic packets are Layer 2 broadcasts. If your devices are on different VLANs, you need a relay or directed broadcast.

3. Have a Fallback If the GPU doesn't wake, fall back gracefully:

if not wait_for_gpu():
    print("GPU unavailable, using cloud API fallback")
    return cloud_client.generate(prompt)

4. Shutdown Delay Add a delay before shutdown if multiple jobs might come in quick succession. I use a 5-minute idle timer before auto-shutdown.

#When to Use This Pattern

This approach works well when:

You have occasional, bursty ML/AI workloads
Your GPU PC is power-hungry (gaming rigs, workstations)
You care about power costs or noise/heat
Jobs can tolerate 1-2 minute startup latency

It's probably overkill if:

You run ML jobs continuously
Your GPU is low-power (integrated, cloud instance)
Startup latency is unacceptable

#What's Next

I'm considering:

Request queuing to batch multiple jobs before wake
Predictive wake based on usage patterns
Remote access via Tailscale for off-network use

This pattern powers my Recall knowledge system and local image generation via ComfyUI.