February 13, 2026
• 8 min read
The Problem: GPUs Are Power Hungry
I run a personal knowledge system called Recall that indexes 2000+ meeting notes using vector embeddings. The catch? Generating embeddings on my NAS (CPU-only) takes 20+ hours for a full reindex.
My GPU PC has an RTX 5090 that can do the same job in under 5 minutes. But leaving a 1000W+ system running 24/7 just for occasional ML jobs? That's:
- ~$300/year in electricity
- Constant fan noise
- Unnecessary heat in my office
I needed GPU power on-demand, not 24/7.
The Solution: Wake-on-LAN + Auto-Shutdown
The idea is simple:
- Wake the GPU PC only when there's work to do
- Run the ML workload (embeddings, inference, image generation)
- Shutdown automatically when done
Here's how I built it.
Architecture Overview
┌─────────────────┐ WoL Packet ┌─────────────────┐
│ NAS │ ──────────────────► │ GPU PC │
│ (K3s cluster) │ │ (RTX 5090) │
│ │ ◄────────────────── │ │
│ Recall API │ Ollama API │ Ollama │
│ │ (embeddings) │ ComfyUI │
│ │ ──────────────────► │ │
│ Cron/Agent │ Shutdown API │ Shutdown Svc │
└─────────────────┘ └─────────────────┘
Step 1: WoL Server on the NAS
Wake-on-LAN requires sending a "magic packet" to the GPU PC's MAC address. I run a simple Python server on the NAS:
# wol-server.py
from flask import Flask, request, jsonify
from wakeonlan import send_magic_packet
app = Flask(__name__)
@app.route('/wake', methods=['POST'])
def wake():
mac = request.json.get('mac')
send_magic_packet(mac)
return jsonify({"status": "sent", "mac": mac})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=9753)
Important: WoL only works on the same Layer 2 network. If your NAS and GPU PC are on different VLANs, you'll need a relay.
Step 2: Shutdown Server on the GPU PC
The GPU PC runs a tiny HTTP server that accepts authenticated shutdown requests:
# shutdown-server.py
from flask import Flask, request, jsonify
import subprocess
app = Flask(__name__)
SHUTDOWN_TOKEN = "your-secret-token"
@app.route('/shutdown', methods=['POST'])
def shutdown():
token = request.headers.get('Authorization', '').replace('Bearer ', '')
if token != SHUTDOWN_TOKEN:
return jsonify({"error": "unauthorized"}), 401
subprocess.Popen(['shutdown', '-h', 'now']) # Linux
return jsonify({"status": "shutting down"})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8765)
This runs as a systemd service that starts on boot.
Step 3: The Orchestration Script
My daily sync script ties it all together:
def run_gpu_reindex():
# 1. Check if there are new files to index
has_new, count, total = check_for_new_files()
if not has_new:
print("No new files - skipping GPU wake")
return
# 2. Wake the GPU PC
requests.post("http://nas:9753/wake", json={"mac": "AA:BB:CC:DD:EE:FF"})
# 3. Wait for Ollama to be ready
for _ in range(36): # 3 minutes
try:
r = requests.get("http://gpu-pc:11434/api/tags", timeout=5)
if r.status_code == 200:
break
except:
time.sleep(5)
# 4. Run the indexing job (uses GPU Ollama)
requests.post("http://nas:30889/index/start", json={"full": True})
# 5. Wait for completion
while True:
progress = requests.get("http://nas:30889/index/progress").json()
if progress["status"] != "running":
break
time.sleep(30)
# 6. Shutdown GPU PC
requests.post(
"http://gpu-pc:8765/shutdown",
headers={"Authorization": "Bearer your-secret-token"}
)
Step 4: Smart Pre-checks
The key optimization: don't wake the GPU if there's nothing to do.
I track the last successful index timestamp and compare file modification times:
def check_for_new_files():
last_index = load_last_index_time()
new_files = [f for f in vault.glob("*.md")
if f.stat().st_mtime > last_index]
return len(new_files) > 0, len(new_files), total_files
This means my GPU PC only wakes when there are actual changes to process.
Beyond Embeddings: ComfyUI for Image Generation
Once I had on-demand GPU access working for embeddings, I realized the same pattern works for image and video generation.
I run ComfyUI on the GPU PC alongside Ollama. When my AI assistant needs to generate an image, it:
- Wakes the GPU PC
- Waits for ComfyUI to be ready (port 8188)
- Submits the workflow via API
- Retrieves the generated image
- Shuts down (or leaves running if more requests are expected)
ComfyUI Client
class ComfyUIClient:
def __init__(self, host="gpu-pc", port=8188):
self.url = f"http://{host}:{port}"
def generate_image(self, prompt, workflow="sdxl-txt2img"):
# Load and customize workflow
workflow_data = load_workflow(workflow)
workflow_data["6"]["inputs"]["text"] = prompt
workflow_data["3"]["inputs"]["seed"] = random.randint(0, 2**32)
# Submit to ComfyUI
resp = requests.post(f"{self.url}/prompt",
json={"prompt": workflow_data})
prompt_id = resp.json()["prompt_id"]
# Poll for completion
while True:
history = requests.get(f"{self.url}/history/{prompt_id}").json()
if prompt_id in history:
break
time.sleep(0.5)
# Fetch and return image
image_info = history[prompt_id]["outputs"]["9"]["images"][0]
return self.download_image(image_info)
Video Generation with Wan 2.1
The RTX 5090 has enough VRAM (32GB) to run Wan 2.1 for text-to-video:
def generate_video(self, prompt):
workflow = load_workflow("wan-t2v-mp4")
workflow["4"]["inputs"]["text"] = prompt
# Video generation takes ~2.5 minutes
resp = self.submit_and_wait(workflow, timeout=300)
return self.download_video(resp)
| Task | Time | Notes |
|---|---|---|
| SDXL image | ~4s | 1024x1024 |
| Video (33 frames) | ~2.5 min | 832x480, 20 steps |
The power-on latency (~60s boot + ~30s ComfyUI load) is acceptable for creative tasks where I'm not in a hurry.
Results
| Metric | Before | After |
|---|---|---|
| Full reindex time | 20+ hours | 5 minutes |
| Image generation | Cloud API costs | Free (local) |
| GPU PC uptime | 24/7 | ~10-30 min/day |
| Monthly power cost | ~$25 | ~$2-3 |
| Heat output | Constant | Minimal |
Gotchas and Lessons
1. WoL Requires BIOS Setup Enable "Wake on LAN" in your BIOS/UEFI. Also enable it in your OS network settings.
2. WoL Doesn't Cross Subnets Magic packets are Layer 2 broadcasts. If your devices are on different VLANs, you need a relay or directed broadcast.
3. Have a Fallback If the GPU doesn't wake, fall back gracefully:
if not wait_for_gpu():
print("GPU unavailable, using cloud API fallback")
return cloud_client.generate(prompt)
4. Shutdown Delay Add a delay before shutdown if multiple jobs might come in quick succession. I use a 5-minute idle timer before auto-shutdown.
When to Use This Pattern
This approach works well when:
- You have occasional, bursty ML/AI workloads
- Your GPU PC is power-hungry (gaming rigs, workstations)
- You care about power costs or noise/heat
- Jobs can tolerate 1-2 minute startup latency
It's probably overkill if:
- You run ML jobs continuously
- Your GPU is low-power (integrated, cloud instance)
- Startup latency is unacceptable
What's Next
I'm considering:
- Request queuing to batch multiple jobs before wake
- Predictive wake based on usage patterns
- Remote access via Tailscale for off-network use
This pattern powers my Recall knowledge system and local image generation via ComfyUI.
