Comparison · with receipts

We ran GLM and Claude Opus 4.8 through 9 coding tasks. The frontier model was the cheapest.

Everyone quotes price-per-token. So we ran 9 verifiable coding tasks through Opus 4.8, GLM-4.6, and GLM-5.2 — on cerver — and tested every answer by actually running the code. The result is the opposite of what the sticker price predicts.

"This model is 10× cheaper per token" is the most repeated — and most misleading — line in AI right now. Per-token price is an input. What you actually pay is tokens × price × number-of-tries-to-get-it-right. The only way to know that number is to run real work and measure it. So we did.

How we ran it

Nine coding tasks, each with a definite correct answer we could check automatically — fix a buggy binary search, merge intervals, write an IPv4 regex, fix a Go data race, add N business days, a non-mutating deep merge. We sent the same prompt to all three models through cerver, saved each answer plus its cost and latency, then extracted the code and ran it against a fixed test suite (Python, SQLite, Go). A task passes only if every case passes. The judge is code execution — not opinion — so it doesn't matter that one contestant happens to be the model writing this.

Opus 4.8
Anthropic · claude-opus-4-8
9/9
correct
cost$0.0211
time29s
$/correct$0.0023
GLM-4.6
Z.ai · glm-4.6
9/9
correct
cost$0.0521
time445s
$/correct$0.0058
GLM-5.2
Z.ai · glm-5.2
7/9
correct
cost$0.0494
time377s
$/correct$0.0071

The frontier model won on every axis

Opus 4.8 was the most correct (9/9), the cheapest (about half the cost of GLM-4.6), and 6–15× faster. How does the "expensive" model end up cheapest? The Z.ai models do heavy hidden reasoning, so they emit 10–20× more tokens per task. Cheap tokens × a huge pile of them ≥ the frontier price. And GLM-5.2 twice looped all the way to its token cap and returned nothing at all — while still billing about $0.018 for the privilege.

TaskOpus 4.8GLM-4.6GLM-5.2
Fix a buggy binary searchPASS$0.0024 · 3sPASS$0.0024 · 16sPASS$0.0008 · 8s
Merge overlapping intervalsPASS$0.0026 · 5sPASS$0.0034 · 26sPASS$0.0009 · 11s
IPv4-address regexPASS$0.0022 · 3sPASS$0.0058 · 47sPASS$0.0040 · 40s
2nd-highest salary (SQL)PASS$0.0012 · 2sPASS$0.0033 · 32sPASS$0.0025 · 22s
Fix a Go data racePASS$0.0030 · 3sPASS$0.0016 · 14sPASS$0.0012 · 12s
Median of two sorted listsPASS$0.0040 · 4sPASS$0.0157 · 141sPASS$0.0029 · 25s
Add N business daysPASS$0.0028 · 4sPASS$0.0043 · 34sFAIL$0.0177 · 116s
Quoted-string regex (escapes)PASS$0.0010 · 2sPASS$0.0024 · 20sPASS$0.0018 · 19s
Non-mutating deep dict mergePASS$0.0020 · 3sPASS$0.0132 · 116sFAIL$0.0176 · 124s

This isn't "GLM is bad." GLM-4.6 matched Opus on correctness, and the models are genuinely capable. The point is narrower and more useful: per-token price tells you almost nothing about per-task cost. You have to measure it on your own tasks — which is the whole reason to run it through one neutral layer.

The receipts

Every prompt and every raw answer is below — expand any task. Costs are cerver's per-token estimates; the gap is driven by token volume, not rate.

Fix a buggy binary search binsearch_bugfix
Prompt
This Python binary search is buggy and returns wrong indices for some inputs. Return the corrected function `def bsearch(arr, target)` that returns the index of target or -1.

def bsearch(arr, target):
    lo, hi = 0, len(arr)
    while lo <= hi:
        mid = (lo + hi) // 2
        if arr[mid] == target: return mid
        elif arr[mid] < target: lo = mid
        else: hi = mid
    return -1
Opus 4.8PASS
$0.0024 · 3s · 121 tok
```python
def bsearch(arr, target):
    lo, hi = 0, len(arr) - 1
    while lo <= hi:
        mid = (lo + hi) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            lo = mid + 1
        else:
            hi = mid - 1
    return -1
```
GLM-4.6PASS
$0.0024 · 16s · 1068 tok
```python
def bsearch(arr, target):
    lo, hi = 0, len(arr) - 1
    while lo <= hi:
        mid = (lo + hi) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            lo = mid + 1
        else:
            hi = mid - 1
    return -1
```
GLM-5.2PASS
$0.0008 · 8s · 335 tok
```python
def bsearch(arr, target):
    lo, hi = 0, len(arr) - 1
    while lo <= hi:
        mid = (lo + hi) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            lo = mid + 1
        else:
            hi = mid - 1
    return -1
```
Merge overlapping intervals merge_intervals
Prompt
Implement `def merge_intervals(intervals)` that merges overlapping/touching integer intervals (list of [start,end]) and returns a sorted list of merged [start,end]. Handle empty input. Touching intervals like [1,2] and [2,3] merge to [1,3].
Opus 4.8PASS
$0.0026 · 5s · 150 tok
```python
def merge_intervals(intervals):
    if not intervals:
        return []
    intervals = sorted(intervals, key=lambda x: (x[0], x[1]))
    merged = [list(intervals[0])]
    for start, end in intervals[1:]:
        if start <= merged[-1][1]:
            merged[-1][1] = max(merged[-1][1], end)
        else:
            merged.append([start, end])
    return merged
```
GLM-4.6PASS
$0.0034 · 26s · 1504 tok
```python
def merge_intervals(intervals):
    if not intervals:
        return []

    # Sort intervals based on the start value
    intervals.sort(key=lambda x: x[0])

    merged = [intervals[0]]

    for current_start, current_end in intervals[1:]:
        last_start, last_end = merged[-1]

        # If the current interval overlaps or touches the previous one
        if current_start <= last_end:
            # Merge them by updating the end of the last interval
            merged[-1][1] = max(last_end, current_end)
        else:
            # Add the current interval as a new entry
            merged.append([current_start, current_end])

    return merged
```
GLM-5.2PASS
$0.0009 · 11s · 389 tok
```python
def merge_intervals(intervals):
    if not intervals:
        return []
    sorted_intervals = sorted(intervals, key=lambda x: x[0])
    merged = [list(sorted_intervals[0])]
    for start, end in sorted_intervals[1:]:
        last_end = merged[-1][1]
        if start <= last_end:
            merged[-1][1] = max(last_end, end)
        else:
            merged.append([start, end])
    return merged
```
IPv4-address regex ipv4_regex
Prompt
Provide a single Python statement `PATTERN = r'...'` whose regex fully matches a valid IPv4 address: four dot-separated octets, each 0-255, no leading zeros (except a lone 0). Use ^...$ anchors. Output only the assignment line in a code block.
Opus 4.8PASS
$0.0022 · 3s · 120 tok
```python
PATTERN = r'^(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$'
```
GLM-4.6PASS
$0.0058 · 47s · 2609 tok
```python
PATTERN = r'^(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])(?:\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])){3}$'
```
GLM-5.2PASS
$0.0040 · 40s · 1779 tok
```python
PATTERN = r'^(?:(?:0|[1-9][0-9]?|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}(?:0|[1-9][0-9]?|1[0-9]{2}|2[0-4][0-9]|25[0-5])$'
```
2nd-highest salary (SQL) nth_salary_sql
Prompt
Table employees(id INTEGER, name TEXT, salary INTEGER). Write ONE standard SQL query (SQLite-compatible) returning a single column `second` = the second-highest DISTINCT salary, or NULL if it doesn't exist. Output only the query.
Opus 4.8PASS
$0.0012 · 2s · 54 tok
```sql
SELECT MAX(salary) AS second
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
```
GLM-4.6PASS
$0.0033 · 32s · 1497 tok
```sql
SELECT MAX(salary) AS second
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
```
GLM-5.2PASS
$0.0025 · 22s · 1094 tok
```sql
SELECT MAX(salary) AS second FROM employees WHERE salary < (SELECT MAX(salary) FROM employees);
```
Fix a Go data race go_race_fix
Prompt
This Go code has a data race on the counter. Return a corrected, compilable version of the whole program that produces 1000 deterministically, using sync primitives.

package main
import ("fmt";"sync")
func main(){
 c:=0
 var wg sync.WaitGroup
 for i:=0;i<1000;i++{ wg.Add(1); go func(){ c++; wg.Done() }() }
 wg.Wait()
 fmt.Println(c)
}
Opus 4.8PASS
$0.0030 · 3s · 160 tok
```go
package main

import (
	"fmt"
	"sync"
)

func main() {
	c := 0
	var wg sync.WaitGroup
	var mu sync.Mutex
	for i := 0; i < 1000; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			mu.Lock()
			c++
			mu.Unlock()
		}()
	}
	wg.Wait()
	fmt.Println(c)
}
```
GLM-4.6PASS
$0.0016 · 14s · 713 tok
```go
package main

import (
	"fmt"
	"sync"
)

func main() {
	var c int
	var mu sync.Mutex
	var wg sync.WaitGroup

	for i := 0; i < 1000; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			mu.Lock()
			c++
			mu.Unlock()
		}()
	}

	wg.Wait()
	fmt.Println(c)
}
```
GLM-5.2PASS
$0.0012 · 12s · 505 tok
```go
package main

import (
	"fmt"
	"sync"
)

func main() {
	c := 0
	var mu sync.Mutex
	var wg sync.WaitGroup
	for i := 0; i < 1000; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			mu.Lock()
			c++
			mu.Unlock()
		}()
	}
	wg.Wait()
	fmt.Println(c)
}
```
Median of two sorted lists median_two_sorted
Prompt
Implement `def median(a, b)` returning the median (float) of two already-sorted integer lists a and b combined. Handle even/odd total length and the case where one list is empty. Do not mutate inputs.
Opus 4.8PASS
$0.0040 · 4s · 244 tok
```python
def median(a, b):
    merged = []
    i = j = 0
    while i < len(a) and j < len(b):
        if a[i] <= b[j]:
            merged.append(a[i])
            i += 1
        else:
            merged.append(b[j])
            j += 1
    while i < len(a):
        merged.append(a[i])
        i += 1
    while j < len(b):
        merged.append(b[j])
        j += 1

    n = len(merged)
    if n == 0:
        raise ValueError("median of empty sequence is undefined")

    mid = n // 2
    if n % 2 == 1:
        return float(merged[mid])
    return (merged[mid - 1] + merged[mid]) / 2.0
```
GLM-4.6PASS
$0.0157 · 141s · 7102 tok
```python
def median(a, b):
    # Ensure a is the smaller list to optimize the binary search space
    if len(a) > len(b):
        a, b = b, a
        
    m, n = len(a), len(b)
    low, high = 0, m
    
    while low <= high:
        partition_a = (low + high) // 2
        partition_b = (m + n + 1) // 2 - partition_a
        
        # Handle edge cases where partition is at the start or end of a list
        max_left_a = float('-inf') if partition_a == 0 else a[partition_a - 1]
        min_right_a = float('inf') if partition_a == m else a[partition_a]
        
        max_left_b = float('-inf') if partition_b == 0 else b[partition_b - 1]
        min_right_b = float('inf') if partition_b == n else b[partition_b]
        
        # Check if we have found the correct partition
        if max_left_a <= min_right_b and max_left_b <= min_right_a:
            if (m + n) % 2 == 0:
                # Even total length: average of max left and min right
                return (max(max_left_a, max_left_b) + min(min_right_a, min_right_b)) / 2.0
            else:
                # Odd total length: max of left partition
                return float(max(max_left_a, max_left_b))
        elif max_left_a > min_right_b:
            # We are too far right in list a, move left
            high = partition_a - 1
        else:
            # We are too far left in list a, move right
            low = partition_a + 1
            
    raise ValueError("Input lists are not sorted.")
```
GLM-5.2PASS
$0.0029 · 25s · 1301 tok
```python
def median(a, b):
    n, m = len(a), len(b)
    total = n + m
    if total == 0:
        return 0.0
    mid = total // 2
    i = j = 0
    prev = curr = 0
    for _ in range(mid + 1):
        prev = curr
        if i < n and (j >= m or a[i] <= b[j]):
            curr = a[i]
            i += 1
        else:
            curr = b[j]
            j += 1
    if total % 2 == 1:
        return float(curr)
    else:
        return (prev + curr) / 2.0
```
Add N business days business_days
Prompt
Implement `def add_business_days(date_str, n)` where date_str is 'YYYY-MM-DD' and n>=0. Return the date 'YYYY-MM-DD' that is n business days after date_str, skipping Saturdays and Sundays. n=0 returns the same date if it is a weekday, otherwise the next weekday. Use the datetime module.
Opus 4.8PASS
$0.0028 · 4s · 152 tok
```python
from datetime import datetime, timedelta


def add_business_days(date_str, n):
    date = datetime.strptime(date_str, '%Y-%m-%d')
    while date.weekday() >= 5:
        date += timedelta(days=1)
    count = 0
    while count < n:
        date += timedelta(days=1)
        if date.weekday() < 5:
            count += 1
    return date.strftime('%Y-%m-%d')
```
GLM-4.6PASS
$0.0043 · 34s · 1929 tok
```python
from datetime import datetime, timedelta

def add_business_days(date_str, n):
    # Parse the input string into a date object
    current_date = datetime.strptime(date_str, '%Y-%m-%d').date()
    
    if n == 0:
        # If it's a weekday (0-4), return the same date
        # If Saturday (5), add 2 days to get to Monday
        # If Sunday (6), add 1 day to get to Monday
        weekday = current_date.weekday()
        if weekday < 5:
            return date_str
        elif weekday == 5:
            return (current_date + timedelta(days=2)).strftime('%Y-%m-%d')
        else: # weekday == 6
            return (current_date + timedelta(days=1)).strftime('%Y-%m-%d')
    
    # For n > 0, advance day by day and count only weekdays
    added_days = 0
    while added_days < n:
        current_date += timedelta(days=1)
        if current_date.weekday() < 5: # Monday is 0, Friday is 4
            added_days += 1
            
    return current_date.strftime('%Y-%m-%d')
```
GLM-5.2FAIL
$0.0177 · 116s · 8000 tok
(no output returned)
Quoted-string regex (escapes) escaped_quote_regex
Prompt
Provide `PATTERN = r'...'`: a Python regex that fully matches (^...$) a double-quoted string literal that may contain backslash-escaped characters, e.g. matches "a\"b" and "" and "x\\" but NOT "a"b" (an unescaped inner quote). Output only the assignment line in a code block.
Opus 4.8PASS
$0.0010 · 2s · 34 tok
```python
PATTERN = r'^"(?:\\.|[^"\\])*"$'
```
GLM-4.6PASS
$0.0024 · 20s · 1044 tok
```python
PATTERN = r'^"(?:[^"\\]|\\.)*"$'
```
GLM-5.2PASS
$0.0018 · 19s · 807 tok
```python
PATTERN = r'^"([^"\\]|\\.)*"$'
```
Non-mutating deep dict merge deep_merge
Prompt
Implement `def deep_merge(a, b)` that recursively merges two nested dicts and returns a NEW dict; values in b win, nested dicts merge recursively, and neither input is mutated. Non-dict values (including lists) from b replace those in a.
Opus 4.8PASS
$0.0020 · 3s · 111 tok
```python
def deep_merge(a, b):
    result = dict(a)
    for key, b_val in b.items():
        if key in result and isinstance(result[key], dict) and isinstance(b_val, dict):
            result[key] = deep_merge(result[key], b_val)
        else:
            result[key] = b_val
    return result
```
GLM-4.6PASS
$0.0132 · 116s · 5996 tok
```python
import copy

def deep_merge(a, b):
    # Create a deep copy of 'a' to avoid mutating the input
    result = copy.deepcopy(a)
    
    for key, b_val in b.items():
        # If the key exists in result and both values are dicts, merge recursively
        if key in result and isinstance(result[key], dict) and isinstance(b_val, dict):
            result[key] = deep_merge(result[key], b_val)
        else:
            # Otherwise, value from 'b' replaces the value in 'a' (or creates new key)
            # We deepcopy b_val to ensure 'b' is not mutated via the result
            result[key] = copy.deepcopy(b_val)
            
    return result
```
GLM-5.2FAIL
$0.0176 · 124s · 8000 tok
(no output returned)

What to distrust: 9 tasks with objectively checkable answers is a small, coding-only sample — not a verdict on general intelligence. Costs are estimates from published per-token rates. Z.ai reasoning is on by default here; a non-reasoning config could be faster and cheaper, and GLM-5.2 needs a large token budget or it silently returns nothing. One aside worth its own line: GLM-5.2 insisted "Claude Opus 4.8 doesn't exist." It's right that it can't see it — 4.8 is newer than its training. Leaderboards go stale; live evals don't.

Stop guessing which model to use.

Run this exact bake-off — or your own tasks, on your own codebase — across any model through one interface. cerver hands you the transcript, the cost, and the verdict for each.

← All posts