Please note: This project is no longer active. The website is kept online for historic purposes only.
If you´re looking for a Linux driver for your Atheros WLAN device, you should continue here .

Ticket #228 (closed defect: fixed)

Opened 13 years ago

Last modified 13 years ago

[PATCH] Apparent race condition between mode changing and scanning.

Reported by: imr1@waikato.ac.nz Assigned to:
Priority: major Milestone: version 0.9.2
Component: madwifi: other Version: trunk
Keywords: Cc: scottraynel@gmail.com, matt@crc.net.nz
Patch is attached: 1 Pending:

Description

Under revision 1358 of the SVN there is an apparent race condition between scanning and changing mode. When active scanning changing the mode (iwpriv athX mode Y) will either work immediately, or will have no effect until a scan is manually prompted (iwlist athX scan) at which point it changes to the new mode prior to commencing the scan. If the scanning debug output is turned on (80211debug +scan) then the change in mode always occurs immediately.

If the interface is associated mode changes occur immediately. The debug output affecting the behaviour in this case suggests a race condition, and I am investigating the exact cause at this point.

Attachments

imr-delay-cancel-scan-patch.diff (435 bytes) - added by imr1@waikato.ac.nz on 12/18/05 22:20:21.
Delays completion of the ieee80211_cancel_scan() method until the scan is actually cancelled.
imr-setmode-delay-patch.diff (0.7 kB) - added by imr1@waikato.ac.nz on 12/18/05 23:05:39.
More focused fix, which is much less likely to break other areas.
setmode-delay-with-timeout.diff (1.2 kB) - added by scottraynel@gmail.com on 02/01/06 23:41:54.
Revised setmode patch which includes a timeout
setmode-delay-with-timeout2.diff (1.2 kB) - added by scottraynel@gmail.com on 02/01/06 23:50:34.
Revised setmode patch which includes a timeout (fixed)

Change History

12/14/05 22:40:51 changed by imr1@waikato.ac.nz

A slight correction. If a manual scan is required to force the mode change then the scan will return no scan results. Thus it cannot be determined what set of channels is actually being scanned, if any.

12/14/05 22:57:15 changed by imr1@waikato.ac.nz

When swapping between two modes where there should be scan results on both the manual scan which kick-starts the mode change returns no scan results. This implies that no scanning is actually occurring.

As a clarification, once the mode has actually changed (including through the prompting by a manual scan) the device will scan correctly.

12/14/05 23:22:52 changed by Matt Brown

  • cc changed from smr26@cs.waikato.ac.nz to smr26@cs.waikato.ac.nz, matt@crc.net.nz.

12/15/05 08:00:43 changed by mrenzmann

  • version set to trunk.
  • milestone set to version 0.9.0 - move to new codebase.

In addition to the comments to this ticket, there is a discussion about this issue on madwifi-devel. See: http://thread.gmane.org/gmane.linux.drivers.madwifi.devel/1802

12/18/05 22:20:21 changed by imr1@waikato.ac.nz

  • attachment imr-delay-cancel-scan-patch.diff added.

Delays completion of the ieee80211_cancel_scan() method until the scan is actually cancelled.

12/18/05 22:26:32 changed by imr1@waikato.ac.nz

The attached patch adds a call to mdelay() at the end of the ieee80211_cancel_scan() method in ieee80211_scan.c

This delays the return from ieee80211_cancel_scan() long enough that the timer it is waiting to fire before the scan is actually cancelled can fire before other code can execute. Prior to the patch the timer wasn't firing until after ieee80211_newstate() had tried to start a new active scan, which failed because a scan was already in progress. Thus, because mode changes are intimately tied to the beginning of scans, the mode wasn't changed and scanning stopped.

12/18/05 23:05:39 changed by imr1@waikato.ac.nz

  • attachment imr-setmode-delay-patch.diff added.

More focused fix, which is much less likely to break other areas.

12/18/05 23:10:13 changed by imr1@waikato.ac.nz

Added another patch (imr-setmode-delay-patch.diff), this one spins on the flag which indicates the scan has been cancelled inside ieee80211_ioctl_setmode() and delays a short time if the flag indicates the scan is still running.

This fixes this bug, and is much less likely to break other areas. On the other hand, it will also not fix similar race conditions in other areas. Is probably the preferable patch overall though.

Signed-off-by: Ian M. Rawley <imr1@waikato.ac.nz>

(Signature also applies to the previous patch.

01/26/06 22:32:30 changed by ttrimble@ucsd.edu

These patches did not work for me. I am using rev 1416 and tried both of these patches and I still get the same error: root-root> iwpriv ath0 mode 1 scheduling while atomic: iwpriv/0x00000100/7639

[<c027938c>] schedule+0x5ac/0x5c0 [<c010f785>] try_to_wake_up+0x35/0x90 [<c027940e>] wait_for_completion+0x6e/0xb0

...

Commands after modprobe: wlanconfig ath0 destroy wlanconfig ath0 create wlandev wifi0 wlanmode wds ifconfig ath0 up iwpriv ath0 mode 1 error...

lspci: 0000:00:10.0 Ethernet controller: Atheros Communications, Inc. AR5212 802.11abg NIC (rev 0)

modprobe ath_pci:

root-root> modprobe ath_pci ath_hal: 0.9.16.13 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413, DFS) ath_rate_sample: 1.2 ath_pci: 0.9.4.5 (Atheros/multi-bss) wifi0: 11a rates: 6Mbps 9Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps wifi0: 11b rates: 1Mbps 2Mbps 5.5Mbps 11Mbps wifi0: 11g rates: 1Mbps 2Mbps 5.5Mbps 11Mbps 6Mbps 9Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps wifi0: turboA rates: 6Mbps 9Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps wifi0: turboG rates: 6Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps wifi0: H/W encryption support: WEP AES AES_CCM TKIP wifi0: mac 5.9 phy 4.3 radio 3.6 wifi0: Use hw queue 1 for WME_AC_BE traffic wifi0: Use hw queue 0 for WME_AC_BK traffic wifi0: Use hw queue 2 for WME_AC_VI traffic wifi0: Use hw queue 3 for WME_AC_VO traffic wifi0: Use hw queue 8 for CAB traffic wifi0: Use hw queue 9 for beacons wifi0: Atheros 5212: mem=0xa0000000, irq=10

01/27/06 03:26:14 changed by imr1@waikato.ac.nz

I've just reproduced your problem, and it only happens when swapping to wds mode. I think it is significantly more closely related to tickets #94, #183 and #237. At the bottom of the stack trace should be the message "unable to load wlan_scan_wds" if I am right, which means that it hasn't loaded the scanner module and thus can't start a scan.

01/31/06 22:26:33 changed by scottraynel@gmail.com

  • cc changed from smr26@cs.waikato.ac.nz, matt@crc.net.nz to scottraynel@gmail.com, matt@crc.net.nz.
  • summary changed from Apparent race condition between mode changing and scanning. to [PATCH] Apparent race condition between mode changing and scanning..

This patch works for me, and has been sitting here for quite some time - can it please be applied to trunk at some point soon. Cheers.

02/01/06 06:42:27 changed by kelmo

  • patch_attached set to 1.

02/01/06 07:47:02 changed by kelmo

any comments to an enhancement of the attached patch suggested by dplatt on #275 concerning an alternative to using mdelay?

02/01/06 23:41:15 changed by scottraynel@gmail.com

Hi,

I read the comments on #275, and added an iteration with a timeout to the current patch attached to this ticket, in order to catch the case where cancel_scan doesn't work. My solution returns ETIMEDOUT on a timeout, but it appears that the userspace tools (iwpriv) don't report errors, so if the cancel_scan does time out (I simulated this by making my timeout 1ms), the driver is left in an inconsistent state, and the user is not notified. I added a debug message to the scan debug, however enabling the scanning debug output causes this race condition to not occur, so a scan debug message doesn't help at all.

When the cancel_scan call does time out, there is no way to gracefully bail out using the current design of the drivers - in fact, I think it's more to do with the design of the net80211 stack, which seems inherently racy. In order to get the driver back to a known state, we must change to the scanning state (which forces all of the current settings), but if cancel_scan has timedout, then we cannot start a new one without locking up (the original problem this ticket tries to solve), so we have a problem.

At present, my current solution is this: Make the timeout value something like 100ms, which should be fine for 99.9% of the time (cancel_scan should, in theory, only take a maximum of 10ms to fire). However, if cancel_scan does time out (or another scan is started between cancel_scan firing and us noticing the cancellation of the scan), then there is a problem. There is no way at present to bail nicely, so I suggest using a straight printk to inform the user of the problem. Symptoms upon failure that I see are the interface going into ad-hoc mode (from sta mode), and jumping between two frequencies - not the desired result. Destroying and re-creating the interface does not fix the problem, only removing and reloading the driver.

Unfortunately, I think a complete fix to the problem is non-trivial. It may be that we comprimise, and be happy that 99.9% of the time it works, with the possibility that the driver may get into a completely unusable state, but without locking up the system. As a side note, we might want to declare the timeout value in ieee80211_vars.h, so that a solution for #275 can use the same value.

02/01/06 23:41:54 changed by scottraynel@gmail.com

  • attachment setmode-delay-with-timeout.diff added.

Revised setmode patch which includes a timeout

02/01/06 23:50:34 changed by scottraynel@gmail.com

  • attachment setmode-delay-with-timeout2.diff added.

Revised setmode patch which includes a timeout (fixed)

02/01/06 23:52:16 changed by scottraynel@gmail.com

Fixed the while loop condition of the setmode-delay-with-timeout.diff patch.

Signed-off-by: Scott Raynel <scottraynel@gmail.com>

04/19/06 07:54:22 changed by dyqith

Since this patch hasn't been applied for a long time, I was wondering if this race condition is still happening ?

04/19/06 08:05:33 changed by Matt Brown

We're currently running r1474 with this patch applied and haven't seen any problems for weeks.

I have no reason to believe that the problem has gone away, without the patch we were able to reliably reproduce this problem.

Do you have any specific commits that you believe would have affected this problem? Please don't interpret our silence as anything other than us being happy with the patch and patiently waiting for a developer to have enough time to commit it.

Cheers

04/19/06 09:57:52 changed by kelmo

This patch was queued because the proposed patch from #275 had the possibility to use a common timeout loop.

04/20/06 14:23:01 changed by kelmo

  • milestone changed from version 0.9.0 - move to new codebase to version 0.9.x - progressive release candidate phase.

05/15/06 01:40:38 changed by dyqith

related comment in ticket:572

05/28/06 18:27:03 changed by Mister_X

  • patch_attached set to 1.

Rechecked Patch Attached (because of spam)

07/04/06 12:21:12 changed by kelmo

  • status changed from new to closed.
  • resolution set to fixed.

setmode-delay-with-timeout2.diff was applied to r1665

07/04/06 12:26:21 changed by kelmo

  • milestone changed from version 0.9.x - progressive release candidate phase to version 0.9.2.