Copy.Fail: When the Kernel Trusts Too Much

Sometimes you hit a vulnerability that isn’t just “bad”.

It’s clean.

Not elegant. Not pretty.

But clean in the way it slices straight through assumptions we’ve quietly depended on for years.

CVE-2026-31431 is one of those.

The shape of the problem

At a high level:

A logic flaw in the Linux kernel enables arbitrary page cache writes from an unprivileged context.

That sentence alone should make you pause.

Because page cache writes mean:

modifying files without write permissions
altering binaries owned by root
bypassing normal filesystem integrity expectations

Now layer in:

no race condition
no offset dependency
works inside containers
tiny, reliable primitive

…and this stops being “just another LPE”.

This becomes a primitive.

The chain

This vulnerability isn’t a single mistake.

It’s a chain of perfectly valid components:

authenc (crypto subsystem)
AF_ALG (crypto via sockets)
splice() (zero-copy memory transfer)

Individually: fine.

Together:

user-controlled data ends up in the page cache of arbitrary files.

The trust boundary failure

We think in layers:

flowchart LR A["Container Process"] --> B["Kernel"] B --> C["Filesystem"]

We assume:

kernel enforces isolation
filesystem enforces permissions

Reality here:

flowchart LR A["Container"] --> B["AF_ALG"] B --> C["authenc flaw"] C --> D["splice()"] D --> E["Page Cache"] E --> F["Host Files Modified"]

The kernel becomes the confused deputy.

Containers don’t save you

Typical container:

non-root
restricted capabilities
shared kernel

Observed:

uid/euid: 1000 1000
AF_ALG: allowed
modules: authenc, algif_aead

That’s default behaviour.

Which means:

An unprivileged container can reach a vulnerable kernel path affecting host files.

Testing safely

Python check

import os, socket, platform

print("uid/euid:", os.getuid(), os.geteuid())
print("kernel:", platform.release())
print("machine:", platform.machine())

try:
    s = socket.socket(socket.AF_ALG, socket.SOCK_SEQPACKET, 0)
    print("AF_ALG: allowed")
    s.close()
except Exception as e:
    print("AF_ALG: blocked/unavailable:", repr(e))

print("\nmodules:")
os.system("grep -E 'algif|authenc|aead' /proc/modules || true")

Docker harness

sudo docker run -it --rm   --user 1000:1000   -v "$PWD":/app   -w /app   python:3   python3 safe-check.py

Fast mitigation: seccomp

Create profile

cat > seccomp-block-af_alg.json <<'JSON'
{
  "defaultAction": "SCMP_ACT_ALLOW",
  "architectures": [
    "SCMP_ARCH_AARCH64",
    "SCMP_ARCH_ARM",
    "SCMP_ARCH_X86_64",
    "SCMP_ARCH_X86"
  ],
  "syscalls": [
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ERRNO",
      "args": [
        {
          "index": 0,
          "value": 38,
          "op": "SCMP_CMP_EQ"
        }
      ]
    }
  ]
}
JSON

Run with it

sudo docker run -it --rm   --user 1000:1000   -v "$PWD":/app   -w /app   --security-opt seccomp="$PWD/seccomp-block-af_alg.json"   python:3   python3 safe-check.py

Expected:

AF_ALG: blocked/unavailable

Why this matters

This is a perfect example of:

Remove the entry point → kill the exploit path

No reboot. No patch yet.

Just removing reachability.

But don’t stop there

Seccomp is containment, not remediation.

You still need:

kernel patching
module review (algif_aead, authenc)
runtime hardening
removal of privileged containers

The pattern (this is the real lesson)

This isn’t about crypto.

It’s about trusted component chaining.

Same pattern shows up everywhere:

identity systems trusting client signals
APIs exposing control planes
SaaS platforms enabling unintended flows
kernels assuming “safe paths”

Different layer.

Same failure.

TL;DR

Arbitrary page cache write primitive
Works from unprivileged containers
Default environments exposed
Seccomp provides fast mitigation
Kernel patch required

Final thought

We talk about “container escape”.

But more often than not:

The container never contained anything
because the boundary wasn’t as strong as we thought

Kubernetes seccomp example

For Kubernetes, the same mitigation can be applied by shipping the seccomp profile to each node and referencing it from the pod security context.

Example profile path on the node:

/var/lib/kubelet/seccomp/profiles/seccomp-block-af_alg.json

Example pod:

apiVersion: v1
kind: Pod
metadata:
  name: copyfail-check
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/seccomp-block-af_alg.json
  containers:
    - name: check
      image: python:3
      command: ["python3", "/app/safe-check.py"]
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
      volumeMounts:
        - name: app
          mountPath: /app
  volumes:
    - name: app
      configMap:
        name: copyfail-check-script
  restartPolicy: Never

And the script as a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: copyfail-check-script
data:
  safe-check.py: |
    import os, socket, platform

    print("uid/euid:", os.getuid(), os.geteuid())
    print("kernel:", platform.release())
    print("machine:", platform.machine())

    try:
        s = socket.socket(socket.AF_ALG, socket.SOCK_SEQPACKET, 0)
        print("AF_ALG: allowed")
        s.close()
    except Exception as e:
        print("AF_ALG: blocked/unavailable:", repr(e))

    print("\nmodules:")
    os.system("grep -E 'algif|authenc|aead' /proc/modules || true")

Apply and check logs:

kubectl apply -f copyfail-check-configmap.yaml
kubectl apply -f copyfail-check-pod.yaml
kubectl logs pod/copyfail-check

Expected mitigated result:

AF_ALG: blocked/unavailable

Important operational note: Localhost seccomp profiles are node-local. That means the JSON profile must exist on every node that may schedule the workload. In production, you would normally distribute it with your node image, bootstrap process, DaemonSet, or node management tooling.

Also, this protects workloads that actually use the profile. It will not protect privileged pods, host processes, or pods scheduled without the seccomp profile unless you enforce it through admission control or policy.

Enforcing seccomp with policy (Pod Security Admission / Kyverno)

You can move from “best effort” to enforced by requiring a seccomp profile at admission time.

Option 1 – Pod Security Admission (PSA)

Use the restricted profile (Kubernetes v1.25+), which requires seccomp:

apiVersion: v1
kind: Namespace
metadata:
  name: workloads-secure
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

This doesn’t force your specific profile, but it forces the use of seccomp (no more “unset”).

Option 2 – Kyverno (enforce your exact profile)

Require your AF_ALG-blocking profile on all pods:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-seccomp-af-alg-block
spec:
  validationFailureAction: enforce
  rules:
    - name: require-seccomp-profile
      match:
        resources:
          kinds:
            - Pod
      validate:
        message: "Pods must use the AF_ALG-blocking seccomp profile"
        pattern:
          spec:
            securityContext:
              seccompProfile:
                type: "Localhost"
                localhostProfile: "profiles/seccomp-block-af_alg.json"

You can extend this to also block privileged: true and require allowPrivilegeEscalation: false.

Distributing the seccomp profile (DaemonSet)

Because Localhost profiles are node-local, you need to ensure the JSON exists on every node.

A simple approach is a DaemonSet that writes the profile file to the kubelet seccomp directory.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: seccomp-profile-distributor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: seccomp-profile-distributor
  template:
    metadata:
      labels:
        app: seccomp-profile-distributor
    spec:
      hostPID: true
      containers:
        - name: installer
          image: busybox
          securityContext:
            privileged: true
          command:
            - /bin/sh
            - -c
            - |
              mkdir -p /host/seccomp/profiles
              cat <<'EOF' > /host/seccomp/profiles/seccomp-block-af_alg.json
              {
                "defaultAction": "SCMP_ACT_ALLOW",
                "architectures": [
                  "SCMP_ARCH_AARCH64",
                  "SCMP_ARCH_ARM",
                  "SCMP_ARCH_X86_64",
                  "SCMP_ARCH_X86"
                ],
                "syscalls": [
                  {
                    "names": ["socket"],
                    "action": "SCMP_ACT_ERRNO",
                    "args": [
                      {
                        "index": 0,
                        "value": 38,
                        "op": "SCMP_CMP_EQ"
                      }
                    ]
                  }
                ]
              }
              EOF
              sleep infinity
          volumeMounts:
            - name: seccomp
              mountPath: /host/seccomp
      volumes:
        - name: seccomp
          hostPath:
            path: /var/lib/kubelet/seccomp
            type: DirectoryOrCreate

This ensures every node has:

/var/lib/kubelet/seccomp/profiles/seccomp-block-af_alg.json

Putting it together

DaemonSet → ensures profile exists on every node
Kyverno / PSA → ensures pods must use seccomp
Pod spec → references your AF_ALG-blocking profile

Result:

The vulnerable kernel path becomes unreachable from workloads, even before patching.

Final note

This is the kind of control that scales:

Fast to deploy
Low disruption
High impact

But still:

Patch the kernel.

Always patch the kernel.

Copy.Fail: When the Kernel Trusts Too Much