Please note: This project is no longer active. The website is kept online for historic purposes only.
If you´re looking for a Linux driver for your Atheros WLAN device, you should continue here .

Ticket #472 (closed defect: fixed)

Opened 14 years ago

Last modified 14 years ago

Device Driver Crashes, Kernel Panics and Locking Problems

Reported by: dyqith@gmail.com Assigned to: dyqith
Priority: major Milestone: version 0.9.0 - move to new codebase
Component: madwifi: driver Version: trunk
Keywords: Cc: dimitris@gmail.com,daniel.blueman@gmail.com
Patch is attached: 1 Pending:

Description

As many people know, there are lots of tickets open about kernel panics, driver crashes and other hard to reproduce problems.

I've spent some time re-doing the driver locking calls to avoid some of these problems. I'll put up the patch here in a few hours (after stress testing it myself first).

Here is a list of all active tickets that the patch may help solve:

http://madwifi.org/ticket/74

http://madwifi.org/ticket/97

http://madwifi.org/ticket/182

http://madwifi.org/ticket/191

http://madwifi.org/ticket/228

http://madwifi.org/ticket/279

http://madwifi.org/ticket/287

http://madwifi.org/ticket/292

http://madwifi.org/ticket/318

http://madwifi.org/ticket/319

http://madwifi.org/ticket/321

http://madwifi.org/ticket/377

http://madwifi.org/ticket/378

http://madwifi.org/ticket/397

http://madwifi.org/ticket/402

http://madwifi.org/ticket/420

http://madwifi.org/ticket/437

http://madwifi.org/ticket/456

http://madwifi.org/ticket/464

Note: I'm not sure of the stability of this patch, so use at your own risk.

Attachments

madwifi-locks.diff (42.2 kB) - added by dyqith@gmail.com on 03/17/06 08:50:00.
redone locking mechanisms to avoid double locking
madwifi-locks2.diff (35.7 kB) - added by dyqith@gmail.com on 03/24/06 17:54:42.
Better locking for the driver
spin_locks-20050328b.diff (42.9 kB) - added by dyqith@gmail.com on 03/29/06 00:09:03.
Even better locking support
spin_locks-20050331.diff (48.7 kB) - added by dyqith@gmail.com on 04/01/06 02:36:28.
added more stability stuff (i.e. changed some dev_kfree_skb's to dev_kfree_skb_any)
spinlocks-20060406.diff (49.1 kB) - added by daniel.blueman@gmail.com on 04/05/06 15:00:09.
As previous patch, but fix spinlock recursion when unloading modules
spinlocks-20060406-2.diff (50.1 kB) - added by daniel.blueman@gmail.com on 04/06/06 08:16:08.
As previous, with further corrections to ACL locking
acl-locking-20060409.diff (1.4 kB) - added by daniel.blueman@gmail.com on 04/09/06 18:41:46.
Fix ACL lock usage - tested
spinlocks-20060419.diff (50.1 kB) - added by dyqith on 04/20/06 00:12:47.
updated spinlock patch to r1518

Change History

03/17/06 08:50:00 changed by dyqith@gmail.com

  • attachment madwifi-locks.diff added.

redone locking mechanisms to avoid double locking

03/17/06 08:51:30 changed by dyqith@gmail.com

Hopefully this patch fixes all those random crashing/kernel panic errors.

Signed-off-by: Daniel Wu <dyqith@gmail.com>

03/17/06 11:34:53 changed by svens

At least on my thinkpad this patch introduces several new random crashes. Unfortunately even LKCD doesn't work after a crash, so i have no way to present you a kernel oops output. I suggest to do such big code changes only with a specific reason, instead of rewriting the hole code and hope that this fixes the open bugs. There are several bugs in your 'hopefully fixed' list, that you cannot fix with a 'global patch'

03/17/06 14:42:23 changed by mrenzmann

  • patch_attached set to 1.

03/17/06 16:28:39 changed by jsd@av8n.com

1) Thanks for working on this ....

2) However, this patch is very unhelpful on my system. Immediate panic:

fatal exception in interupt in ieee80211_fine_node called from ieee80211_find_rxnode called from ath_rx_tasklet called from handle_irq_event

I am using svn 1475 which works OK in simple cases whereas if I try 1475 with this patch, I can load the modules, but the command init.d/network start never finishes

Debian sarge, kernel rev 2.6.15.4

03/17/06 18:32:12 changed by dyqith@gmail.com

to svens: I was wondering if you can elaborate on the commands you used to cause the crashes ?

to jsd@av8n.com Can you provide the scripts used for startup ?

03/17/06 18:44:51 changed by svens

command was ifup ath0, ath0 section in /etc/network/interfaces:

iface ath0 inet dhcp
        pre-up /etc/init.d/wpasupplicant restart
        post-up /usr/bin/wlan.sh
        pre-down /etc/init.d/wpasupplicant stop
        post-down /usr/bin/wlan.sh

(/usr/bin/wlan.sh starts openvpn with a config file depending on the joined wireless network) Distribution is Debian/unstable with a 2.6.15.1 kernel (with LKCD patched).

03/17/06 19:01:04 changed by dyqith@gmail.com

to svens: Can you provide the /usr/bin/wlan.sh script ? It'll be helpful to see the commands you run for the madwifi driver

03/17/06 19:17:58 changed by svens

#!/bin/sh

if [ "$IFACE" == "ath0" ]; then
        case "$MODE" in
                start)
                        ESSID=$(expr "$(iwconfig $IFACE)" : ".*ESSID:\"\([^\"]*\)")
                        if [ "$ESSID" == "neta" -o "$ESSID" == "netc" ]; then
                                openvpn --config /etc/openvpn/openvpn.conf
                        else
                                pkill -f 'openvpn --config /etc/openvpn/openvpn.conf'
                        fi
                        if [ "$ESSID" == "netb" ]; then

                                openvpn --config /etc/openvpn/openvpn-ext.conf
                        else
                                pkill -f 'openvpn --config /etc/openvpn/openvpn-ext.conf'
                        fi

                        touch /tmp/.wlanrunning
                        ;;
                stop)
                        pkill -f 'openvpn --config /etc/openvpn/openvpn.conf'
                        rm /tmp/.wlanrunning
                        ;;
        esac
fi

03/20/06 11:40:27 changed by svens

  • priority changed from blocker to major.

03/24/06 17:54:42 changed by dyqith@gmail.com

  • attachment madwifi-locks2.diff added.

Better locking for the driver

03/24/06 17:57:28 changed by dyqith@gmail.com

The new patch should work better than the old one.

anyone care to try it ?

03/24/06 18:12:54 changed by dyqith@gmail.com

Changes made in the patch:

- Fix some double locking issues from process context/tasklet to interrupt - Convert IEEE80211_NODE_LOCK to a spinlock instead of a rw_spinlock - Convert IEEE80211_BEACON_LOCK/IEEE80211_UAPSD_LOCK to IEEE80211_LOCK (locks same thing, less confusing) - Redefined #defines to avoid ambiguity and standardize them properly - Used lockflags in local context only - ath_tx_timeout calls ath_reset instead of ath_init to avoid semaphore locking

Signed-off-by: Daniel Wu <dyqith@gmail.com>

03/29/06 00:09:03 changed by dyqith@gmail.com

  • attachment spin_locks-20050328b.diff added.

Even better locking support

03/31/06 07:33:00 changed by mrenzmann

  • status changed from new to assigned.
  • owner set to mrenzmann.

Report of success with this patch in ticket #503.

04/01/06 02:36:28 changed by dyqith@gmail.com

  • attachment spin_locks-20050331.diff added.

added more stability stuff (i.e. changed some dev_kfree_skb's to dev_kfree_skb_any)

04/01/06 02:38:50 changed by dyqith@gmail.com

A new verion of th patch up, following the foot steps of ticket http://madwifi.org/ticket/480 I went through and changed all the parts that dev_kfree_skb was used in hwIRQ context, and also added some notes for developers to know which of the major functions are in what context.

Signed-off-by: Daniel Wu <dyqith@gmail.com>

Can people test this out ? I haven't heard much reply back from the last round, so i'm hoping its a good thing.

04/01/06 18:39:59 changed by Stijn Tintel <stijn@linux-ipv6.be>

I have applied spin_locks-20050331.diff to madwifi-ng r1488, and I'm using the patched driver on both my AP and STA. As far as I have tested it, it doesn't _work_ better. I think I even notice a speed decrease...

FYI: I haven't had any kernel panics with madwifi-ng for a while, except for the one reported in #390.

If you need more info or like me to do some more testing, feel free to mail me about it - using this ticket for those things will load it too much imo.

04/03/06 01:39:00 changed by dyqith@gmail.com

to Stijin: I looked through the patches to see what's wrong. It may be some strict locking_irq's that's slowing the dev. driver down. I'll fix it up soon, but I would like to make sure everything is stable first.

to Svens and jsd@av8n.com: Does the latest version still crash/hang your systems ?

I think this patch should help the SMP/PREEMP folks with single cards/single vaps.

There may be a problem or two with multiple vaps/multiple cards. Anybody willing to test in this area ?

04/03/06 08:02:14 changed by svens

The latest version of your patch doesn't crash my machine, and it seems like its working ok. I'll test the patch this week and let you know how things are working.

04/04/06 03:10:40 changed by dimitris@gmail.com

I've had several crashes with IRQ handling in the stack, so I'll try the latest patch, spin_locks-20050331.diff - BTW the filename is misleading :)

My system is a SMP (P4/hyperthreading) machine with one card and two AP vaps.

Would my kernel .config be useful to attach to this bug?

04/04/06 03:13:35 changed by anonymous

  • cc set to dimitris@gmail.com.

04/04/06 03:14:46 changed by anonymous

Yeah, I was still thinking about 2005 when I wrote the patches...

If/when it crashes with the patch, please provide the stack trace/kernel panic message. And either the .config or just let me know if its SMP/preemptible.

thanks for testing.

04/04/06 03:17:45 changed by anonymous

CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_BKL is not set

# Linux kernel version: 2.6.16.1-SMP # CONFIG_X86_BIGSMP is not set CONFIG_SMP=y CONFIG_X86_FIND_SMP_CONFIG=y CONFIG_X86_SMP=y

CONFIG_SCHED_SMT=y

04/04/06 11:27:23 changed by daniel.blueman@gmail.com

  • cc changed from dimitris@gmail.com to dimitris@gmail.com,daniel.blueman@gmail.com.

I'm running an IA32 system with 2.6.16 and desktop (ie voluntary) preemption.

I have found the changes in spin_locks-20050331.diff to eliminate the oops I was seeing when the interface is up - I have seen the AR5212 chip lockup and reset (below) a number of times, but the host system is rock-solid:

wifi0: 11a rates: 6Mbps 9Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps
wifi0: 11b rates: 1Mbps 2Mbps 5.5Mbps 11Mbps
wifi0: 11g rates: 1Mbps 2Mbps 5.5Mbps 11Mbps 6Mbps 9Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps
wifi0: turboG rates: 6Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps
wifi0: H/W encryption support: WEP AES AES_CCM TKIP
wifi0: mac 10.5 phy 6.1 radio 6.3
wifi0: Use hw queue 1 for WME_AC_BE traffic
wifi0: Use hw queue 0 for WME_AC_BK traffic
wifi0: Use hw queue 2 for WME_AC_VI traffic
wifi0: Use hw queue 3 for WME_AC_VO traffic
wifi0: Use hw queue 8 for CAB traffic
wifi0: Use hw queue 9 for beacons
wifi0: Atheros 5212: mem=0xdfae0000, irq=201
wlan: mac acl policy registered
wifi0: hardware error; reseting
wifi0: hardware error; reseting
wifi0: hardware error; reseting
wifi0: hardware error; reseting
wifi0: hardware error; reseting

My kernel has debug features enabled and I have a serial connection for harvesting a clean backtrace - I'll test for the bringup/teardown oops I was seeing in the unpatched code and post my results in the next day. Oh, I'll jackup the SSID broadcast rate to check for another race in the unpatched code too.

04/04/06 22:31:28 changed by dimitris@gmail.com

So far, so good. I've had low to moderate traffic on both vaps, and no crashes or even errors in the logs.

The card is a 3CRDAG675B:

Apr  3 18:12:39 greebo kernel: wifi0: 11a rates: 6Mbps 9Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps
Apr  3 18:12:39 greebo kernel: wifi0: 11b rates: 1Mbps 2Mbps 5.5Mbps 11Mbps
Apr  3 18:12:39 greebo kernel: wifi0: 11g rates: 1Mbps 2Mbps 5.5Mbps 11Mbps 6Mbps 9Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps
Apr  3 18:12:39 greebo kernel: wifi0: turboA rates: 6Mbps 9Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps
Apr  3 18:12:39 greebo kernel: wifi0: turboG rates: 6Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps
Apr  3 18:12:39 greebo kernel: wifi0: H/W encryption support: WEP AES AES_CCM TKIP
Apr  3 18:12:39 greebo kernel: wifi0: mac 10.5 phy 6.1 radio 6.3
Apr  3 18:12:39 greebo kernel: wifi0: Use hw queue 1 for WME_AC_BE traffic
Apr  3 18:12:39 greebo kernel: wifi0: Use hw queue 0 for WME_AC_BK traffic
Apr  3 18:12:39 greebo kernel: wifi0: Use hw queue 2 for WME_AC_VI traffic
Apr  3 18:12:39 greebo kernel: wifi0: Use hw queue 3 for WME_AC_VO traffic
Apr  3 18:12:39 greebo kernel: wifi0: Use hw queue 8 for CAB traffic
Apr  3 18:12:39 greebo kernel: wifi0: Use hw queue 9 for beacons
Apr  3 18:12:39 greebo kernel: wifi0: Atheros 5212: mem=0xfbe90000, irq=19

04/05/06 15:00:09 changed by daniel.blueman@gmail.com

  • attachment spinlocks-20060406.diff added.

As previous patch, but fix spinlock recursion when unloading modules

04/05/06 15:11:35 changed by daniel.blueman@gmail.com

Sending wireless traffic through the interface is solid with Daniel Wu's 20050331 patch. My kernel is built with spinlock (and all else) debugging; I am seeing this failure when removing the modules:

# ifdown wlan0 && rmmod ath_pci ath_rate_sample ath_hal
BUG: spinlock recursion on CPU#0, wlanconfig/6522
 lock: f7b2eb94, .magic: dead4ead, .owner: wlanconfig/6522, .owner_cpu: 0
 [<c01041fd>] show_trace+0xd/0x10
 [<c0104217>] dump_stack+0x17/0x20
 [<c01d29ee>] spin_bug+0x9e/0xc0
 [<c01d2d64>] _raw_spin_lock+0x154/0x160
 [<c02cd568>] _spin_lock+0x8/0x10
 [<f88f104b>] acl_free_all_locked+0xb/0x40 [wlan_acl]
 [<f88f10a3>] acl_detach+0x23/0x60 [wlan_acl]
 [<f8959e76>] ieee80211_proto_vdetach+0x26/0x30 [wlan]
 [<f8947505>] ieee80211_vap_detach+0x95/0x130 [wlan]
 [<f8908f7a>] ath_vap_delete+0x10a/0x320 [ath_pci]
 [<f896245c>] ieee80211_ioctl+0x9c/0x550 [wlan]
 [<c0264368>] dev_ifsioc+0xe8/0x360
 [<c0264867>] dev_ioctl+0x227/0x3e0
 [<c025923a>] sock_ioctl+0x3a/0x270
 [<c0165330>] do_ioctl+0x20/0x70
 [<c01653d7>] vfs_ioctl+0x57/0x2b0
 [<c0165669>] sys_ioctl+0x39/0x60
 [<c0102e01>] syscall_call+0x7/0xb

From the code, it is clear that net80211/ieee80211_acl.c::acl_free_all_locked() is taking the ACL spinlock a second time; it is taken the first time in the call-site in net80211/ieee80211_acl.c::acl_detach().

I updated Daniel Wu's great patch with a fix for this - further testing has shown no load/unload and in-use problems on my 2.6.16 preempt+debug kernel.

Perhaps this can be committed for wider testing soon, since the current SVN head is really broken?

Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>

04/05/06 19:21:10 changed by dyqith

Hey Daniel B, Thanks for your patch. Can you also check the code path from acl_free_all() ? I think that one needs a lock when it calls acl_free_all_locked(). I'm alittle busy atm to create/test a patch.

thanks, Let me also ask the rest of the devel team and see when we'll merge this to trunk.

04/06/06 08:16:08 changed by daniel.blueman@gmail.com

  • attachment spinlocks-20060406-2.diff added.

As previous, with further corrections to ACL locking

04/06/06 08:17:49 changed by daniel.blueman@gmail.com

I've added the locking into acl_free_all() as it should be, and added the lock assertion in acl_free_all_locked() which should have caught the unlocked access. The patch fixes a couple of trivial typos too.

Tests out well with all kernel debugging enabled and voluntary preeempt on IA32.

Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>

04/06/06 21:19:42 changed by dyqith

I just found out that I can't merge this patch in to trunk unless its 100% stable in what its suppose to fix.

so, I'll try to spend more time going through every piece of the code to ensure all locks are where they should be.

Of course, if anyone wants to help out, they can.

04/09/06 18:40:19 changed by daniel.blueman@gmail.com

Perhaps a good approach is to break out the patch into fixes for different areas and verify independently?

We know the changes to the ACL code in ieee80211_acl.c correct lock usage, are tested and ready to commit.

I've broken this into a separate patch, acl-locking-20060409.diff (slightly updated from previous big patch).

Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>

04/09/06 18:41:46 changed by daniel.blueman@gmail.com

  • attachment acl-locking-20060409.diff added.

Fix ACL lock usage - tested

04/09/06 20:09:32 changed by dyqith

I thought about breaking things into smaller pieces, but sometimes we won't be able to test some locking mechs when other locking mechs cause the recursions and deadlocks.

Also, I changed the #define's a lot, which would cause everything to change anyways.

From what I can tell, the crashes/panics I get now (after applying the patch) is not from spinlock recursions/deadlocks, but from memory freeing errors and not holding locks in some places.

So, I guess it solves the original problem of deadlocks/recursions, but not these two new errors that only show up when the locking mechs are correct.

Any thoughts ?

04/09/06 22:12:59 changed by daniel.blueman@gmail.com

With the big patch and ACL locking patch, I have yet to see a crash running in AP mode. Kernel is compiled UP, voluntary preemption is enabled, so spinlocks are not noops.

What are you doing when you get crashes? Can you post a clean backtrace? I can try to reproduce if you can supply the steps.

Can the other project developers do some testing/code review? Maybe we can get merged small and obvious changes like the ACL locking and anything you can break out of the big patch? It would be good to move this forward a bit.

04/20/06 00:12:47 changed by dyqith

  • attachment spinlocks-20060419.diff added.

updated spinlock patch to r1518

04/20/06 09:32:14 changed by mrenzmann

  • status changed from assigned to new.
  • owner changed from mrenzmann to dyqith.

Reviewed the latest patch, and it looks good. Didn't find the time to test it, though, and most probably won't find it soon.

If there are any objections or not yet reported problems, please speak up now.

@dyqith: If no objections are raised during the next 24 hours, feel free to commit the patch.

04/21/06 20:24:55 changed by dyqith

  • status changed from new to closed.
  • resolution set to fixed.

Patch committed, closing ticket.

04/21/06 20:54:05 changed by anonymous

The patch with [1491] worked for days without problems. However, [1520] fails:

Apr 21 11:48:38 greebo kernel: ath_hal: module license 'Proprietary' taints kernel.
Apr 21 11:48:38 greebo kernel: ath_hal: dummy (dummy)
Apr 21 11:48:38 greebo kernel: wlan: 0.8.4.2 (svn 1520)
Apr 21 11:48:38 greebo kernel: ath_rate_sample: 1.2 (svn 1520)
Apr 21 11:48:38 greebo kernel: ath_pci: 0.9.4.5 (svn 1520)
Apr 21 11:48:38 greebo kernel: ACPI: PCI Interrupt 0000:02:03.0[A] -> GSI 19 (level, low) -> IRQ 19
Apr 21 11:48:38 greebo kernel: wifi%%d: unable to attach hardware: 'No hardware present or device not yet supported' (HAL status 1)
Apr 21 11:48:38 greebo kernel: ACPI: PCI interrupt for device 0000:02:03.0 disabled

04/21/06 22:00:10 changed by dyqith

Yeah, that's a change in 1519 that changed things. I'll see if i can back it out.

04/21/06 22:42:39 changed by anonymous

It's also "fixable" by backing out a Makefile's changes, see #557.