Please note: This project is no longer active. The website is kept online for historic purposes only.
If you´re looking for a Linux driver for your Atheros WLAN device, you should continue here .

Ticket #1621 (closed defect: fixed)

Opened 12 years ago

Last modified 12 years ago

Kernel oopses in ath_draintxq, and other ath_node unreferences due to refcount bug

Reported by: anonymous Assigned to: mtaylor
Priority: critical Milestone: version 0.9.5
Component: madwifi: 802.11 stack Version: trunk
Keywords: oops crash Cc:
Patch is attached: 0 Pending:

Description (Last modified by mrenzmann)

Unable to handle kernel NULL pointer dereference at 000000000000014b RIP:
 [<ffffffff881617d8>] :ath_pci:ath_tx_draintxq+0x128/0x270
PGD 253fd067 PUD 253f3067 PMD 0
Oops: 0000 [1] SMP
Modules linked in: wlan_wep ipt_LOG ipt_REJECT xt_tcpudp nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink iptable
_filter ip_tables x_tables wlan_scan_sta fglrx(P) binfmt_misc fan microcode deflate zlib_deflate zlib_inflate twof
ish twofish_common camellia serpent blowfish des cbc ecb blkcipher aes xcbc sha256 sha1 md5 crypto_null hmac crypt
o_hash cryptomgr crypto_algapi af_key fuse dm_snapshot dm_mirror dm_mod cpufreq_ondemand cpufreq_conservative cpuf
req_powersave acpi_cpufreq freq_table kvm_intel kvm joydev hdaps tp_smapi thinkpad_ec loop sg sr_mod ath_rate_samp
le cdrom irtty_sir tsdev sir_dev snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss pcmcia
firmware_class snd_seq_midi snd_rawmidi snd_seq_midi_event nsc_ircc snd_seq snd_timer irda snd_seq_device snd crc_
ccitt ath_pci wlan psmouse thermal ath_hal(P) ata_piix ac serio_raw yenta_socket soundcore snd_page_alloc i2c_i801
 video output battery processor rsrc_nonstatic pcmcia_core iTCO_wdt iTCO_vendor_support button ehci_hcd evdev thin
kpad_acpi intel_agp e1000 ata_generic nvram uhci_hcd pcspkr rtc
Pid: 4441, comm: ifconfig Tainted: P        2.6.23-1 #1
RIP: 0010:[<ffffffff881617d8>]  [<ffffffff881617d8>] :ath_pci:ath_tx_draintxq+0x128/0x270
RSP: 0018:ffff81002d54fc78  EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000002 RCX: ffffffff88161772
RDX: ffff810000e67fb8 RSI: ffff810000e67f90 RDI: ffffffff80587e58
RBP: ffff81002c475000 R08: 0000000000000000 R09: ffff81003e7d8000
R10: 0000000040001071 R11: 00000000ffffffff R12: 0000000000000000
R13: ffff81003e01beb0 R14: ffff81003eae4700 R15: ffff81003eae6158
FS:  00002b1b691a76e0(0000) GS:ffff81003fec1100(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000000014b CR3: 00000000253e8000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ifconfig (pid: 4441, threadinfo ffff81002d54e000, task ffff810007bc3560)
Stack:  ffff81003fe06000 000000010000f7f7 0000000000000003 ffff81003eae60f0
 ffff81003eae6168 ffff81003eae6178 ffff81003e7d8000 0000000000000002
 ffff81003eae4700 0000000000000282 ffff81003e7d8000 ffff81003e7d8000
Call Trace:
 [<ffffffff8816199e>] :ath_pci:ath_draintxq+0x7e/0x220
 [<ffffffff88161f73>] :ath_pci:ath_stop_locked+0x63/0x2f0
 [<ffffffff88162291>] :ath_pci:ath_stop+0x91/0xc0
 [<ffffffff804016d2>] dev_close+0x62/0x90
 [<ffffffff8813803b>] :wlan:ieee80211_stop+0xfb/0x120
 [<ffffffff804016d2>] dev_close+0x62/0x90
 [<ffffffff804007f2>] dev_change_flags+0x92/0x1c0
 [<ffffffff8044f960>] devinet_ioctl+0x590/0x740
 [<ffffffff803f39ff>] sock_ioctl+0xbf/0x240
 [<ffffffff802a6aaf>] do_ioctl+0x2f/0xa0
 [<ffffffff802a6d40>] vfs_ioctl+0x220/0x2c0
 [<ffffffff802a6e75>] sys_ioctl+0x95/0xb0
 [<ffffffff8020c31e>] system_call+0x7e/0x83

Code: f6 80 4b 01 00 00 01 0f 85 b6 00 00 00 f0 ff 4d 38 0f 94 c0
RIP  [<ffffffff881617d8>] :ath_pci:ath_tx_draintxq+0x128/0x270
 RSP <ffff81002d54fc78>
CR2: 000000000000014b

Change History

11/03/07 08:11:21 changed by anonymous

Aditional information: OS: Debian unstable amd64 kernel: Custom, 64 bit, with Tickless patch ( hardware: Thinkpad T60

11/03/07 08:27:33 changed by mtaylor

I believe this is a node refcount bug that happens when a node has already been freed, but the pointer in the buffer still exists as we are sweeping through the queue to drain it. I'm still trying to narrow this down, but it appears to have been exposed by a recent fix where a reference count was being bumped five times inadvertently due to a macro in r2792 which was causing nodes to stick around beyond ath_draintxq.

I've found a few places where we aren't unreferencing enough, but this bug is from unreferencing without zeroing out a pointer. I've found only one or two possible cases of this and I'm testing a patch for it now.

11/03/07 08:27:40 changed by mtaylor

  • status changed from new to assigned.
  • owner set to mtaylor.

(follow-up: ↓ 5 ) 11/03/07 08:28:46 changed by mtaylor

Can you reproduce this reliably? What version of madwifi are you on?

(in reply to: ↑ 4 ) 11/03/07 11:17:20 changed by anonymous

Replying to mtaylor:

Can you reproduce this reliably?

Unfortunately too reliably :( sometimes it ends even with kernel panic.

What version of madwifi are you on?

r2824 from trunk but it started about ~10 revisions back

If it would be helpful I can build the source with debug info and find out exact lines of code.

11/03/07 18:30:51 changed by

reproduced here too, interesting is, that i has been only able to reproduce it - yet - on non-SMP and SLUB kernel. I have tested two kernels yet, one optimized for i486, SLUB, no-SMP, the second optimized for Core Duo, SLAB, SMP enabled. Second one had no problem either on Core Duo HW and on Sempron 64 box, the first one crashed on both on wlanconfig destroy, also on ifconfig down and even randomly. Trace follows :

[  113.140198] BUG: unable to handle kernel NULL pointer dereference at virtual address 000000bb
[  113.148920]  printing eip:
[  113.151711] *pde = 00000000
[  113.154593] Oops: 0000 [#1]
[  113.157472] Modules linked in: wlan_scan_sta wlan_scan_ap snd_pcm_oss snd_mixer_oss nfsd exportfs ipv6 ohci_hcd rtc_cmos rtc_core rtc_lib amd64_agp agpgart e1000 hostap_pci hostap ieee80211_crypt ath_rate_sample ath_pci wlan ath_hal(P) ub shpchp pci_hotplug usb_storage uhci_hcd ehci_hcd usbcore i2c_viapro snd_via82xx snd_ac97_codec ac97_bus snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd via_rhine k8temp eeprom w83781d hwmon_vid i2c_amd756 i2c_core ide_scsi
[  113.203838] CPU:    0
[  113.203839] EIP:    0060:[<e0c8e6eb>]    Tainted: P       VLI
[  113.203840] EFLAGS: 00010286   ( #13)
[  113.220773] EIP is at ieee80211_node_vdetach+0x2b/0xd0 [wlan]
[  113.226597] eax: 00000000   ebx: da75f000   ecx: dff2a3a0   edx: 00000082
[  113.233459] esi: da4e43a0   edi: da4e4000   ebp: dff2a3a0   esp: da0cfe84
[  113.240321] ds: 007b   es: 007b   fs: 0000  gs: 0033  ss: 0068
[  113.246229] Process wlanconfig (pid: 9128, ti=da0ce000 task=da16a500 task.ti=da0ce000)
[  113.254043] Stack: c13f7d80 00000287 c0471cb0 00000287 e0c95715 da4e43a0 dff2a3a0 da4e4000 
[  113.263038]        da4e43a0 dff2a3a0 e0c83c6a 00000000 da4e43a0 00000001 e0cb8069 00000000 
[  113.272036]        c0301116 da4e43a0 e0c971f1 dfe1c79c dffb8000 dff2a000 da0cff18 da4e43a0 
[  113.281035] Call Trace:
[  113.283847]  [<e0c95715>] ieee80211_power_vdetach+0x15/0x20 [wlan]
[  113.290183]  [<e0c83c6a>] ieee80211_vap_detach+0xba/0x100 [wlan]
[  113.296340]  [<e0cb8069>] ath_vap_delete+0xc9/0x350 [ath_pci]
[  113.302230]  [<c0301116>] dev_close+0x56/0x80
[  113.306737]  [<e0c971f1>] ieee80211_stop+0xd1/0x100 [wlan]
[  113.312374]  [<e0ca06d0>] ieee80211_ioctl+0x0/0x100 [wlan]
[  113.318010]  [<e0ca07a8>] ieee80211_ioctl+0xd8/0x100 [wlan]
[  113.323739]  [<c030342a>] dev_ifsioc+0x2fa/0x360
[  113.328507]  [<c030360e>] dev_ioctl+0x17e/0x200
[  113.333186]  [<c02f6ea9>] sock_ioctl+0xb9/0x200
[  113.337857]  [<c02f6df0>] sock_ioctl+0x0/0x200
[  113.342442]  [<c016ebf4>] do_ioctl+0x54/0x70
[  113.346863]  [<c02f74be>] sys_socket+0x2e/0x50
[  113.351453]  [<c016ed3e>] vfs_ioctl+0x5e/0x1c0
[  113.356038]  [<c016eefb>] sys_ioctl+0x5b/0x90
[  113.360538]  [<c0103fd6>] syscall_call+0x7/0xb
[  113.365132]  =======================

Also, it sometimes crashed in the way creator of this ticket posted, will capture it later too.

11/04/07 09:33:14 changed by mrenzmann

  • description changed.

11/05/07 15:59:59 changed by

compiled two exactly same kernels, one with SLAB and the other one with SLUB used- with SLAB allocator, all is ok, no oops on wlanconfig destroy - the panic i posted above, or ifconfig down - trace from it follows :

[  164.951756] BUG: unable to handle kernel NULL pointer dereference at virtual address 000000bb
[  164.960494]  printing eip:
[  164.963299] *pde = 00000000
[  164.966193] Oops: 0000 [#1]
[  164.969081] Modules linked in: wlan_scan_sta wlan_scan_ap nfsd exportfs ipv6 ohci_hcd rtc_cmos rtc_core rtc_lib usbvision videodev v4l2_common v4l1_compat uhci_hcd ehci_hcd usbcore iTCO_wdt i2c_i801 e1000 shpchp pci_hotplug ath_rate_sample ath_pci ath_hal(P) wlan_acl wlan sky2 3c59x skge eeprom w83781d hwmon_vid i2c_amd756 i2c_core ide_scsi
[  165.001353] CPU:    0
[  165.001354] EIP:    0060:[<f8bbfd78>]    Tainted: P       VLI
[  165.001355] EFLAGS: 00010286   ( #9)
[  165.017464] EIP is at ath_tx_draintxq+0xa8/0x1d0 [ath_pci]
[  165.023043] eax: 00000000   ebx: f1ffa000   ecx: f6904600   edx: 00000000
[  165.029917] esi: 00000000   edi: f50e05dc   ebp: f4a63820   esp: f1f29dfc
[  165.036789] ds: 007b   es: 007b   fs: 0000  gs: 0033  ss: 0068
[  165.042714] Process ifconfig (pid: 8654, ti=f1f28000 task=f7c6a000 task.ti=f1f28000)
[  165.050372] Stack: 00000282 f50b8000 f4a623a0 f4a63910 00000282 f50b8000 00000008 00000000 
[  165.059320]        f50b8000 f4a623a0 f4a63820 00000003 f4a623a0 00000001 f8bbffe9 f50b8000 
[  165.069857]        00000009 00000009 00000246 f4a623a0 f50b8000 f4a62000 f8bb55a8 f50b8000 
[  165.078808] Call Trace:
[  165.081640]  [<f8bbffe9>] ath_draintxq+0x79/0x150 [ath_pci]
[  165.087361]  [<f8bb55a8>] ath_stop_locked+0x58/0x1f0 [ath_pci]
[  165.093342]  [<c012ccb0>] wakeme_after_rcu+0x0/0x10
[  165.098361]  [<f8bb577a>] ath_stop+0x3a/0xb0 [ath_pci]
[  165.103650]  [<c030079c>] dev_close+0x5c/0x80
[  165.108151]  [<f8b6e241>] ieee80211_stop+0xd1/0x100 [wlan]
[  165.113798]  [<c030079c>] dev_close+0x5c/0x80
[  165.118295]  [<c030257c>] dev_change_flags+0x4c/0x130
[  165.123492]  [<c03475d9>] devinet_ioctl+0x239/0x550
[  165.128516]  [<c03497ce>] inet_ioctl+0x5e/0x90
[  165.133100]  [<c02f64c4>] sock_ioctl+0xa4/0x200
[  165.137776]  [<c02f6420>] sock_ioctl+0x0/0x200
[  165.142363]  [<c016eb74>] do_ioctl+0x54/0x70
[  165.146778]  [<c016ecbe>] vfs_ioctl+0x5e/0x1b0
[  165.151365]  [<c016ee6b>] sys_ioctl+0x5b/0x90
[  165.155866]  [<c0104012>] syscall_call+0x7/0xb
[  165.160457]  =======================
[  165.164125] Code: 00 00 00 00 85 f6 74 1a 8d 74 26 00 8d bc 27 00 00 00 00 8b 1e 89 f0 89 de e8 25 0d 74 c7 85 db 75 f1 8b 5f 44 85 db 74 1c 8b 03 <f6> 80 bb 00 00 00 01 75 59 ff 4b 1c 0f 94 c0 84 c0 75 46 c7 47 
[  165.186527] EIP: [<f8bbfd78>] ath_tx_draintxq+0xa8/0x1d0 [ath_pci] SS:ESP 0068:f1f29dfc

11/05/07 17:28:53 changed by mtaylor

This is a reference counting problem where references exist after the reference counter reaches zero (i.e. a bug where the reference count was not maintained correctly by code holding these references).

The allocator could change the symptoms, easily. The allocator that does NOT fail is masking the problem. After the ath_node has been freed, subsequent attempts to free the node should cause an oops. Seems that you are getting bad behavior if you got it to work with a different allocator.

11/05/07 17:34:28 changed by mtaylor

  • summary changed from kernel oopses with wlanconfig ath0 destroy, almost always reproducable when wep is used and station is assosiated. to Kernel oopses in ath_draintxq, and other cases where.

Sorry, above I meant ath_buf* not ath_node.

When releasing an ath_buf* the second time, at least one duplicate invocation of dev_kfree_skb_* will happen on the same memory address. The kernel memory allocator is asked to reclaim the same address. Depending on the allocator this can cause various pathology, but do not be fooled into thinking it's ok. I'm working on a proper fix for the problem, and a quick and dirty fix.

11/05/07 17:39:23 changed by mtaylor

r2792 is "Temporary workaround to refcount bugs exposed by r2792"

The patch basically puts a big nasty message about this being a quick and dirty hack, and then bumps the reference count on the node.

If the node is reclaimed more than once, kfree is called on the same address. Behavior is allocator specific. All the SKB may still also point to the node, resulting in use of the node after free.

--- net80211/ieee80211_node.c (revision 2827) +++ net80211/ieee80211_node.c (working copy) @@ -738,6 +738,21 @@

ni->ni_table = nt; tni = ieee80211_ref_node(ni);

+ /* MT: BEGIN HACK + * XXX: r2792 fixed a bug where ieee80211_ref_node was being + * invoked five times on ni because it was passed to TAILQ_INSERT_TAIL + * which happens to evaluate it's value five times. This fixed a problem + * where we were gaining four extra references, but unfortunately exposed + * other bugs which cause kernel panics. I am working on a fix for this as part + * of ticket #1621. In the meantime, adding these references back is 'wrong' + * but better than getting five new critical defects per day while debugging. + */ + ieee80211_ref_node(ni);ieee80211_ref_node(ni); + ieee80211_ref_node(ni);ieee80211_ref_node(ni); + /* + * END HACK + */ +

11/05/07 17:45:31 changed by mtaylor

I am working up a patch that introduces:

A) a set of helper functions for ath_node, ath_buf, skb cleanup B) a little more diagnostics for population statistics on these data structures C) simple, consistent rules for when to take/release references

11/05/07 17:54:41 changed by mtaylor

Ahem, r2828. ;-)

11/05/07 17:55:01 changed by mtaylor

  • summary changed from Kernel oopses in ath_draintxq, and other cases where to Kernel oopses in ath_draintxq, and other ath_node unreferences due to refcount bug.

11/12/07 20:30:16 changed by mentor

  • priority changed from blocker to critical.
  • component changed from madwifi: driver to madwifi: 802.11 stack.
  • status changed from assigned to closed.
  • resolution set to fixed.
  • milestone set to version 0.9.4.

I believe recent fixes makes this code correct. It is, however, still ugly. Documentation and tidying won't go unnoticed.

02/11/08 06:23:08 changed by mrenzmann

  • milestone changed from version 0.9.4 to version 0.9.5.