| CVE |
Vendors |
Products |
Updated |
CVSS v3.1 |
| In the Linux kernel, the following vulnerability has been resolved:
net: hns3: using the num_tqps in the vf driver to apply for resources
Currently, hdev->htqp is allocated using hdev->num_tqps, and kinfo->tqp
is allocated using kinfo->num_tqps. However, kinfo->num_tqps is set to
min(new_tqps, hdev->num_tqps); Therefore, kinfo->num_tqps may be smaller
than hdev->num_tqps, which causes some hdev->htqp[i] to remain
uninitialized in hclgevf_knic_setup().
Thus, this patch allocates hdev->htqp and kinfo->tqp using hdev->num_tqps,
ensuring that the lengths of hdev->htqp and kinfo->tqp are consistent
and that all elements are properly initialized. |
| In the Linux kernel, the following vulnerability has been resolved:
net/sched: ets: Always remove class from active list before deleting in ets_qdisc_change
zdi-disclosures@trendmicro.com says:
The vulnerability is a race condition between `ets_qdisc_dequeue` and
`ets_qdisc_change`. It leads to UAF on `struct Qdisc` object.
Attacker requires the capability to create new user and network namespace
in order to trigger the bug.
See my additional commentary at the end of the analysis.
Analysis:
static int ets_qdisc_change(struct Qdisc *sch, struct nlattr *opt,
struct netlink_ext_ack *extack)
{
...
// (1) this lock is preventing .change handler (`ets_qdisc_change`)
//to race with .dequeue handler (`ets_qdisc_dequeue`)
sch_tree_lock(sch);
for (i = nbands; i < oldbands; i++) {
if (i >= q->nstrict && q->classes[i].qdisc->q.qlen)
list_del_init(&q->classes[i].alist);
qdisc_purge_queue(q->classes[i].qdisc);
}
WRITE_ONCE(q->nbands, nbands);
for (i = nstrict; i < q->nstrict; i++) {
if (q->classes[i].qdisc->q.qlen) {
// (2) the class is added to the q->active
list_add_tail(&q->classes[i].alist, &q->active);
q->classes[i].deficit = quanta[i];
}
}
WRITE_ONCE(q->nstrict, nstrict);
memcpy(q->prio2band, priomap, sizeof(priomap));
for (i = 0; i < q->nbands; i++)
WRITE_ONCE(q->classes[i].quantum, quanta[i]);
for (i = oldbands; i < q->nbands; i++) {
q->classes[i].qdisc = queues[i];
if (q->classes[i].qdisc != &noop_qdisc)
qdisc_hash_add(q->classes[i].qdisc, true);
}
// (3) the qdisc is unlocked, now dequeue can be called in parallel
// to the rest of .change handler
sch_tree_unlock(sch);
ets_offload_change(sch);
for (i = q->nbands; i < oldbands; i++) {
// (4) we're reducing the refcount for our class's qdisc and
// freeing it
qdisc_put(q->classes[i].qdisc);
// (5) If we call .dequeue between (4) and (5), we will have
// a strong UAF and we can control RIP
q->classes[i].qdisc = NULL;
WRITE_ONCE(q->classes[i].quantum, 0);
q->classes[i].deficit = 0;
gnet_stats_basic_sync_init(&q->classes[i].bstats);
memset(&q->classes[i].qstats, 0, sizeof(q->classes[i].qstats));
}
return 0;
}
Comment:
This happens because some of the classes have their qdiscs assigned to
NULL, but remain in the active list. This commit fixes this issue by always
removing the class from the active list before deleting and freeing its
associated qdisc
Reproducer Steps
(trimmed version of what was sent by zdi-disclosures@trendmicro.com)
```
DEV="${DEV:-lo}"
ROOT_HANDLE="${ROOT_HANDLE:-1:}"
BAND2_HANDLE="${BAND2_HANDLE:-20:}" # child under 1:2
PING_BYTES="${PING_BYTES:-48}"
PING_COUNT="${PING_COUNT:-200000}"
PING_DST="${PING_DST:-127.0.0.1}"
SLOW_TBF_RATE="${SLOW_TBF_RATE:-8bit}"
SLOW_TBF_BURST="${SLOW_TBF_BURST:-100b}"
SLOW_TBF_LAT="${SLOW_TBF_LAT:-1s}"
cleanup() {
tc qdisc del dev "$DEV" root 2>/dev/null
}
trap cleanup EXIT
ip link set "$DEV" up
tc qdisc del dev "$DEV" root 2>/dev/null || true
tc qdisc add dev "$DEV" root handle "$ROOT_HANDLE" ets bands 2 strict 2
tc qdisc add dev "$DEV" parent 1:2 handle "$BAND2_HANDLE" \
tbf rate "$SLOW_TBF_RATE" burst "$SLOW_TBF_BURST" latency "$SLOW_TBF_LAT"
tc filter add dev "$DEV" parent 1: protocol all prio 1 u32 match u32 0 0 flowid 1:2
tc -s qdisc ls dev $DEV
ping -I "$DEV" -f -c "$PING_COUNT" -s "$PING_BYTES" -W 0.001 "$PING_DST" \
>/dev/null 2>&1 &
tc qdisc change dev "$DEV" root handle "$ROOT_HANDLE" ets bands 2 strict 0
tc qdisc change dev "$DEV" root handle "$ROOT_HANDLE" ets bands 2 strict 2
tc -s qdisc ls dev $DEV
tc qdisc del dev "$DEV" parent
---truncated--- |
| In the Linux kernel, the following vulnerability has been resolved:
ntfs: set dummy blocksize to read boot_block when mounting
When mounting, sb->s_blocksize is used to read the boot_block without
being defined or validated. Set a dummy blocksize before attempting to
read the boot_block.
The issue can be triggered with the following syz reproducer:
mkdirat(0xffffffffffffff9c, &(0x7f0000000080)='./file1\x00', 0x0)
r4 = openat$nullb(0xffffffffffffff9c, &(0x7f0000000040), 0x121403, 0x0)
ioctl$FS_IOC_SETFLAGS(r4, 0x40081271, &(0x7f0000000980)=0x4000)
mount(&(0x7f0000000140)=@nullb, &(0x7f0000000040)='./cgroup\x00',
&(0x7f0000000000)='ntfs3\x00', 0x2208004, 0x0)
syz_clone(0x88200200, 0x0, 0x0, 0x0, 0x0, 0x0)
Here, the ioctl sets the bdev block size to 16384. During mount,
get_tree_bdev_flags() calls sb_set_blocksize(sb, block_size(bdev)),
but since block_size(bdev) > PAGE_SIZE, sb_set_blocksize() leaves
sb->s_blocksize at zero.
Later, ntfs_init_from_boot() attempts to read the boot_block while
sb->s_blocksize is still zero, which triggers the bug.
[almaz.alexandrovich@paragon-software.com: changed comment style, added
return value handling] |
| In the Linux kernel, the following vulnerability has been resolved:
f2fs: invalidate dentry cache on failed whiteout creation
F2FS can mount filesystems with corrupted directory depth values that
get runtime-clamped to MAX_DIR_HASH_DEPTH. When RENAME_WHITEOUT
operations are performed on such directories, f2fs_rename performs
directory modifications (updating target entry and deleting source
entry) before attempting to add the whiteout entry via f2fs_add_link.
If f2fs_add_link fails due to the corrupted directory structure, the
function returns an error to VFS, but the partial directory
modifications have already been committed to disk. VFS assumes the
entire rename operation failed and does not update the dentry cache,
leaving stale mappings.
In the error path, VFS does not call d_move() to update the dentry
cache. This results in new_dentry still pointing to the old inode
(new_inode) which has already had its i_nlink decremented to zero.
The stale cache causes subsequent operations to incorrectly reference
the freed inode.
This causes subsequent operations to use cached dentry information that
no longer matches the on-disk state. When a second rename targets the
same entry, VFS attempts to decrement i_nlink on the stale inode, which
may already have i_nlink=0, triggering a WARNING in drop_nlink().
Example sequence:
1. First rename (RENAME_WHITEOUT): file2 → file1
- f2fs updates file1 entry on disk (points to inode 8)
- f2fs deletes file2 entry on disk
- f2fs_add_link(whiteout) fails (corrupted directory)
- Returns error to VFS
- VFS does not call d_move() due to error
- VFS cache still has: file1 → inode 7 (stale!)
- inode 7 has i_nlink=0 (already decremented)
2. Second rename: file3 → file1
- VFS uses stale cache: file1 → inode 7
- Tries to drop_nlink on inode 7 (i_nlink already 0)
- WARNING in drop_nlink()
Fix this by explicitly invalidating old_dentry and new_dentry when
f2fs_add_link fails during whiteout creation. This forces VFS to
refresh from disk on subsequent operations, ensuring cache consistency
even when the rename partially succeeds.
Reproducer:
1. Mount F2FS image with corrupted i_current_depth
2. renameat2(file2, file1, RENAME_WHITEOUT)
3. renameat2(file3, file1, 0)
4. System triggers WARNING in drop_nlink() |
| In the Linux kernel, the following vulnerability has been resolved:
ublk: clean up user copy references on ublk server exit
If a ublk server process releases a ublk char device file, any requests
dispatched to the ublk server but not yet completed will retain a ref
value of UBLK_REFCOUNT_INIT. Before commit e63d2228ef83 ("ublk: simplify
aborting ublk request"), __ublk_fail_req() would decrement the reference
count before completing the failed request. However, that commit
optimized __ublk_fail_req() to call __ublk_complete_rq() directly
without decrementing the request reference count.
The leaked reference count incorrectly allows user copy and zero copy
operations on the completed ublk request. It also triggers the
WARN_ON_ONCE(refcount_read(&io->ref)) warnings in ublk_queue_reinit()
and ublk_deinit_queue().
Commit c5c5eb24ed61 ("ublk: avoid ublk_io_release() called after ublk
char dev is closed") already fixed the issue for ublk devices using
UBLK_F_SUPPORT_ZERO_COPY or UBLK_F_AUTO_BUF_REG. However, the reference
count leak also affects UBLK_F_USER_COPY, the other reference-counted
data copy mode. Fix the condition in ublk_check_and_reset_active_ref()
to include all reference-counted data copy modes. This ensures that any
ublk requests still owned by the ublk server when it exits have their
reference counts reset to 0. |
| In the Linux kernel, the following vulnerability has been resolved:
shmem: fix recovery on rename failures
maple_tree insertions can fail if we are seriously short on memory;
simple_offset_rename() does not recover well if it runs into that.
The same goes for simple_offset_rename_exchange().
Moreover, shmem_whiteout() expects that if it succeeds, the caller will
progress to d_move(), i.e. that shmem_rename2() won't fail past the
successful call of shmem_whiteout().
Not hard to fix, fortunately - mtree_store() can't fail if the index we
are trying to store into is already present in the tree as a singleton.
For simple_offset_rename_exchange() that's enough - we just need to be
careful about the order of operations.
For simple_offset_rename() solution is to preinsert the target into the
tree for new_dir; the rest can be done without any potentially failing
operations.
That preinsertion has to be done in shmem_rename2() rather than in
simple_offset_rename() itself - otherwise we'd need to deal with the
possibility of failure after successful shmem_whiteout(). |
| In the Linux kernel, the following vulnerability has been resolved:
Input: lkkbd - disable pending work before freeing device
lkkbd_interrupt() schedules lk->tq via schedule_work(), and the work
handler lkkbd_reinit() dereferences the lkkbd structure and its
serio/input_dev fields.
lkkbd_disconnect() and error paths in lkkbd_connect() free the lkkbd
structure without preventing the reinit work from being queued again
until serio_close() returns. This can allow the work handler to run
after the structure has been freed, leading to a potential use-after-free.
Use disable_work_sync() instead of cancel_work_sync() to ensure the
reinit work cannot be re-queued, and call it both in lkkbd_disconnect()
and in lkkbd_connect() error paths after serio_open(). |
| In the Linux kernel, the following vulnerability has been resolved:
functionfs: fix the open/removal races
ffs_epfile_open() can race with removal, ending up with file->private_data
pointing to freed object.
There is a total count of opened files on functionfs (both ep0 and
dynamic ones) and when it hits zero, dynamic files get removed.
Unfortunately, that removal can happen while another thread is
in ffs_epfile_open(), but has not incremented the count yet.
In that case open will succeed, leaving us with UAF on any subsequent
read() or write().
The root cause is that ffs->opened is misused; atomic_dec_and_test() vs.
atomic_add_return() is not a good idea, when object remains visible all
along.
To untangle that
* serialize openers on ffs->mutex (both for ep0 and for dynamic files)
* have dynamic ones use atomic_inc_not_zero() and fail if we had
zero ->opened; in that case the file we are opening is doomed.
* have the inodes of dynamic files marked on removal (from the
callback of simple_recursive_removal()) - clear ->i_private there.
* have open of dynamic ones verify they hadn't been already removed,
along with checking that state is FFS_ACTIVE. |
| In the Linux kernel, the following vulnerability has been resolved:
drm/xe/oa: Limit num_syncs to prevent oversized allocations
The OA open parameters did not validate num_syncs, allowing
userspace to pass arbitrarily large values, potentially
leading to excessive allocations.
Add check to ensure that num_syncs does not exceed DRM_XE_MAX_SYNCS,
returning -EINVAL when the limit is violated.
v2: use XE_IOCTL_DBG() and drop duplicated check. (Ashutosh)
(cherry picked from commit e057b2d2b8d815df3858a87dffafa2af37e5945b) |
| In the Linux kernel, the following vulnerability has been resolved:
tpm: Cap the number of PCR banks
tpm2_get_pcr_allocation() does not cap any upper limit for the number of
banks. Cap the limit to eight banks so that out of bounds values coming
from external I/O cause on only limited harm. |
| In the Linux kernel, the following vulnerability has been resolved:
net: nfc: fix deadlock between nfc_unregister_device and rfkill_fop_write
A deadlock can occur between nfc_unregister_device() and rfkill_fop_write()
due to lock ordering inversion between device_lock and rfkill_global_mutex.
The problematic lock order is:
Thread A (rfkill_fop_write):
rfkill_fop_write()
mutex_lock(&rfkill_global_mutex)
rfkill_set_block()
nfc_rfkill_set_block()
nfc_dev_down()
device_lock(&dev->dev) <- waits for device_lock
Thread B (nfc_unregister_device):
nfc_unregister_device()
device_lock(&dev->dev)
rfkill_unregister()
mutex_lock(&rfkill_global_mutex) <- waits for rfkill_global_mutex
This creates a classic ABBA deadlock scenario.
Fix this by moving rfkill_unregister() and rfkill_destroy() outside the
device_lock critical section. Store the rfkill pointer in a local variable
before releasing the lock, then call rfkill_unregister() after releasing
device_lock.
This change is safe because rfkill_fop_write() holds rfkill_global_mutex
while calling the rfkill callbacks, and rfkill_unregister() also acquires
rfkill_global_mutex before cleanup. Therefore, rfkill_unregister() will
wait for any ongoing callback to complete before proceeding, and
device_del() is only called after rfkill_unregister() returns, preventing
any use-after-free.
The similar lock ordering in nfc_register_device() (device_lock ->
rfkill_global_mutex via rfkill_register) is safe because during
registration the device is not yet in rfkill_list, so no concurrent
rfkill operations can occur on this device. |
| In the Linux kernel, the following vulnerability has been resolved:
ipv6: fix a BUG in rt6_get_pcpu_route() under PREEMPT_RT
On PREEMPT_RT kernels, after rt6_get_pcpu_route() returns NULL, the
current task can be preempted. Another task running on the same CPU
may then execute rt6_make_pcpu_route() and successfully install a
pcpu_rt entry. When the first task resumes execution, its cmpxchg()
in rt6_make_pcpu_route() will fail because rt6i_pcpu is no longer
NULL, triggering the BUG_ON(prev). It's easy to reproduce it by adding
mdelay() after rt6_get_pcpu_route().
Using preempt_disable/enable is not appropriate here because
ip6_rt_pcpu_alloc() may sleep.
Fix this by handling the cmpxchg() failure gracefully on PREEMPT_RT:
free our allocation and return the existing pcpu_rt installed by
another task. The BUG_ON is replaced by WARN_ON_ONCE for non-PREEMPT_RT
kernels where such races should not occur. |
| In the Linux kernel, the following vulnerability has been resolved:
ASoC: stm32: sai: fix OF node leak on probe
The reference taken to the sync provider OF node when probing the
platform device is currently only dropped if the set_sync() callback
fails during DAI probe.
Make sure to drop the reference on platform probe failures (e.g. probe
deferral) and on driver unbind.
This also avoids a potential use-after-free in case the DAI is ever
reprobed without first rebinding the platform driver. |
| In the Linux kernel, the following vulnerability has been resolved:
drm/ttm: Avoid NULL pointer deref for evicted BOs
It is possible for a BO to exist that is not currently associated with a
resource, e.g. because it has been evicted.
When devcoredump tries to read the contents of all BOs for dumping, we need
to expect this as well -- in this case, ENODATA is recorded instead of the
buffer contents. |
| In the Linux kernel, the following vulnerability has been resolved:
RDMA/cm: Fix leaking the multicast GID table reference
If the CM ID is destroyed while the CM event for multicast creating is
still queued the cancel_work_sync() will prevent the work from running
which also prevents destroying the ah_attr. This leaks a refcount and
triggers a WARN:
GID entry ref leak for dev syz1 index 2 ref=573
WARNING: CPU: 1 PID: 655 at drivers/infiniband/core/cache.c:809 release_gid_table drivers/infiniband/core/cache.c:806 [inline]
WARNING: CPU: 1 PID: 655 at drivers/infiniband/core/cache.c:809 gid_table_release_one+0x284/0x3cc drivers/infiniband/core/cache.c:886
Destroy the ah_attr after canceling the work, it is safe to call this
twice. |
| In the Linux kernel, the following vulnerability has been resolved:
ipv6: BUG() in pskb_expand_head() as part of calipso_skbuff_setattr()
There exists a kernel oops caused by a BUG_ON(nhead < 0) at
net/core/skbuff.c:2232 in pskb_expand_head().
This bug is triggered as part of the calipso_skbuff_setattr()
routine when skb_cow() is passed headroom > INT_MAX
(i.e. (int)(skb_headroom(skb) + len_delta) < 0).
The root cause of the bug is due to an implicit integer cast in
__skb_cow(). The check (headroom > skb_headroom(skb)) is meant to ensure
that delta = headroom - skb_headroom(skb) is never negative, otherwise
we will trigger a BUG_ON in pskb_expand_head(). However, if
headroom > INT_MAX and delta <= -NET_SKB_PAD, the check passes, delta
becomes negative, and pskb_expand_head() is passed a negative value for
nhead.
Fix the trigger condition in calipso_skbuff_setattr(). Avoid passing
"negative" headroom sizes to skb_cow() within calipso_skbuff_setattr()
by only using skb_cow() to grow headroom.
PoC:
Using `netlabelctl` tool:
netlabelctl map del default
netlabelctl calipso add pass doi:7
netlabelctl map add default address:0::1/128 protocol:calipso,7
Then run the following PoC:
int fd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_UDP);
// setup msghdr
int cmsg_size = 2;
int cmsg_len = 0x60;
struct msghdr msg;
struct sockaddr_in6 dest_addr;
struct cmsghdr * cmsg = (struct cmsghdr *) calloc(1,
sizeof(struct cmsghdr) + cmsg_len);
msg.msg_name = &dest_addr;
msg.msg_namelen = sizeof(dest_addr);
msg.msg_iov = NULL;
msg.msg_iovlen = 0;
msg.msg_control = cmsg;
msg.msg_controllen = cmsg_len;
msg.msg_flags = 0;
// setup sockaddr
dest_addr.sin6_family = AF_INET6;
dest_addr.sin6_port = htons(31337);
dest_addr.sin6_flowinfo = htonl(31337);
dest_addr.sin6_addr = in6addr_loopback;
dest_addr.sin6_scope_id = 31337;
// setup cmsghdr
cmsg->cmsg_len = cmsg_len;
cmsg->cmsg_level = IPPROTO_IPV6;
cmsg->cmsg_type = IPV6_HOPOPTS;
char * hop_hdr = (char *)cmsg + sizeof(struct cmsghdr);
hop_hdr[1] = 0x9; //set hop size - (0x9 + 1) * 8 = 80
sendmsg(fd, &msg, 0); |
| In the Linux kernel, the following vulnerability has been resolved:
net: rose: fix invalid array index in rose_kill_by_device()
rose_kill_by_device() collects sockets into a local array[] and then
iterates over them to disconnect sockets bound to a device being brought
down.
The loop mistakenly indexes array[cnt] instead of array[i]. For cnt <
ARRAY_SIZE(array), this reads an uninitialized entry; for cnt ==
ARRAY_SIZE(array), it is an out-of-bounds read. Either case can lead to
an invalid socket pointer dereference and also leaks references taken
via sock_hold().
Fix the index to use i. |
| In the Linux kernel, the following vulnerability has been resolved:
mptcp: fallback earlier on simult connection
Syzkaller reports a simult-connect race leading to inconsistent fallback
status:
WARNING: CPU: 3 PID: 33 at net/mptcp/subflow.c:1515 subflow_data_ready+0x40b/0x7c0 net/mptcp/subflow.c:1515
Modules linked in:
CPU: 3 UID: 0 PID: 33 Comm: ksoftirqd/3 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:subflow_data_ready+0x40b/0x7c0 net/mptcp/subflow.c:1515
Code: 89 ee e8 78 61 3c f6 40 84 ed 75 21 e8 8e 66 3c f6 44 89 fe bf 07 00 00 00 e8 c1 61 3c f6 41 83 ff 07 74 09 e8 76 66 3c f6 90 <0f> 0b 90 e8 6d 66 3c f6 48 89 df e8 e5 ad ff ff 31 ff 89 c5 89 c6
RSP: 0018:ffffc900006cf338 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888031acd100 RCX: ffffffff8b7f2abf
RDX: ffff88801e6ea440 RSI: ffffffff8b7f2aca RDI: 0000000000000005
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000007
R10: 0000000000000004 R11: 0000000000002c10 R12: ffff88802ba69900
R13: 1ffff920000d9e67 R14: ffff888046f81800 R15: 0000000000000004
FS: 0000000000000000(0000) GS:ffff8880d69bc000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000560fc0ca1670 CR3: 0000000032c3a000 CR4: 0000000000352ef0
Call Trace:
<TASK>
tcp_data_queue+0x13b0/0x4f90 net/ipv4/tcp_input.c:5197
tcp_rcv_state_process+0xfdf/0x4ec0 net/ipv4/tcp_input.c:6922
tcp_v6_do_rcv+0x492/0x1740 net/ipv6/tcp_ipv6.c:1672
tcp_v6_rcv+0x2976/0x41e0 net/ipv6/tcp_ipv6.c:1918
ip6_protocol_deliver_rcu+0x188/0x1520 net/ipv6/ip6_input.c:438
ip6_input_finish+0x1e4/0x4b0 net/ipv6/ip6_input.c:489
NF_HOOK include/linux/netfilter.h:318 [inline]
NF_HOOK include/linux/netfilter.h:312 [inline]
ip6_input+0x105/0x2f0 net/ipv6/ip6_input.c:500
dst_input include/net/dst.h:471 [inline]
ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline]
NF_HOOK include/linux/netfilter.h:318 [inline]
NF_HOOK include/linux/netfilter.h:312 [inline]
ipv6_rcv+0x264/0x650 net/ipv6/ip6_input.c:311
__netif_receive_skb_one_core+0x12d/0x1e0 net/core/dev.c:5979
__netif_receive_skb+0x1d/0x160 net/core/dev.c:6092
process_backlog+0x442/0x15e0 net/core/dev.c:6444
__napi_poll.constprop.0+0xba/0x550 net/core/dev.c:7494
napi_poll net/core/dev.c:7557 [inline]
net_rx_action+0xa9f/0xfe0 net/core/dev.c:7684
handle_softirqs+0x216/0x8e0 kernel/softirq.c:579
run_ksoftirqd kernel/softirq.c:968 [inline]
run_ksoftirqd+0x3a/0x60 kernel/softirq.c:960
smpboot_thread_fn+0x3f7/0xae0 kernel/smpboot.c:160
kthread+0x3c2/0x780 kernel/kthread.c:463
ret_from_fork+0x5d7/0x6f0 arch/x86/kernel/process.c:148
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
The TCP subflow can process the simult-connect syn-ack packet after
transitioning to TCP_FIN1 state, bypassing the MPTCP fallback check,
as the sk_state_change() callback is not invoked for * -> FIN_WAIT1
transitions.
That will move the msk socket to an inconsistent status and the next
incoming data will hit the reported splat.
Close the race moving the simult-fallback check at the earliest possible
stage - that is at syn-ack generation time.
About the fixes tags: [2] was supposed to also fix this issue introduced
by [3]. [1] is required as a dependence: it was not explicitly marked as
a fix, but it is one and it has already been backported before [3]. In
other words, this commit should be backported up to [3], including [2]
and [1] if that's not already there. |
| In the Linux kernel, the following vulnerability has been resolved:
iommu: disable SVA when CONFIG_X86 is set
Patch series "Fix stale IOTLB entries for kernel address space", v7.
This proposes a fix for a security vulnerability related to IOMMU Shared
Virtual Addressing (SVA). In an SVA context, an IOMMU can cache kernel
page table entries. When a kernel page table page is freed and
reallocated for another purpose, the IOMMU might still hold stale,
incorrect entries. This can be exploited to cause a use-after-free or
write-after-free condition, potentially leading to privilege escalation or
data corruption.
This solution introduces a deferred freeing mechanism for kernel page
table pages, which provides a safe window to notify the IOMMU to
invalidate its caches before the page is reused.
This patch (of 8):
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. The x86 architecture maps the
kernel's virtual address space into the upper portion of every process's
page table. Consequently, in an SVA context, the IOMMU hardware can walk
and cache kernel page table entries.
The Linux kernel currently lacks a notification mechanism for kernel page
table changes, specifically when page table pages are freed and reused.
The IOMMU driver is only notified of changes to user virtual address
mappings. This can cause the IOMMU's internal caches to retain stale
entries for kernel VA.
Use-After-Free (UAF) and Write-After-Free (WAF) conditions arise when
kernel page table pages are freed and later reallocated. The IOMMU could
misinterpret the new data as valid page table entries. The IOMMU might
then walk into attacker-controlled memory, leading to arbitrary physical
memory DMA access or privilege escalation. This is also a
Write-After-Free issue, as the IOMMU will potentially continue to write
Accessed and Dirty bits to the freed memory while attempting to walk the
stale page tables.
Currently, SVA contexts are unprivileged and cannot access kernel
mappings. However, the IOMMU will still walk kernel-only page tables all
the way down to the leaf entries, where it realizes the mapping is for the
kernel and errors out. This means the IOMMU still caches these
intermediate page table entries, making the described vulnerability a real
concern.
Disable SVA on x86 architecture until the IOMMU can receive notification
to flush the paging cache before freeing the CPU kernel page table pages. |
| In the Linux kernel, the following vulnerability has been resolved:
nfsd: fix nfsd_file reference leak in nfsd4_add_rdaccess_to_wrdeleg()
nfsd4_add_rdaccess_to_wrdeleg() unconditionally overwrites
fp->fi_fds[O_RDONLY] with a newly acquired nfsd_file. However, if
the client already has a SHARE_ACCESS_READ open from a previous OPEN
operation, this action overwrites the existing pointer without
releasing its reference, orphaning the previous reference.
Additionally, the function originally stored the same nfsd_file
pointer in both fp->fi_fds[O_RDONLY] and fp->fi_rdeleg_file with
only a single reference. When put_deleg_file() runs, it clears
fi_rdeleg_file and calls nfs4_file_put_access() to release the file.
However, nfs4_file_put_access() only releases fi_fds[O_RDONLY] when
the fi_access[O_RDONLY] counter drops to zero. If another READ open
exists on the file, the counter remains elevated and the nfsd_file
reference from the delegation is never released. This potentially
causes open conflicts on that file.
Then, on server shutdown, these leaks cause __nfsd_file_cache_purge()
to encounter files with an elevated reference count that cannot be
cleaned up, ultimately triggering a BUG() in kmem_cache_destroy()
because there are still nfsd_file objects allocated in that cache. |