Please note: This project is no longer active. The website is kept online for historic purposes only.
If you´re looking for a Linux driver for your Atheros WLAN device, you should continue here .

Ticket #1897 (new defect)

Opened 6 years ago

Last modified 3 years ago

kernel oops

Reported by: joerg@duck.franken.de Assigned to:
Priority: major Milestone:
Component: madwifi: other Version: trunk
Keywords: Cc:
Patch is attached: 0 Pending:

Description

I can reproduce a kernel crash with madwifi versions between at least r3314 and r3543 in AP mode, probably even sooner and also later. 0.9.4 is also affected by this bug. Tried this on a Gateworks IXP425 board and a PC-Engines WRAP board with a CM9, Senao NMP8602 and a NMP8602-PLUS card.

The software on the test box is sending ethernet frames to a client that is not yet connected or was connected, got out of reach and is re-associating. Prior to the oops is a problem with rating the client (this happens with at least ath_rate_sample and ath_rate_onoe):

ath_rate_sample: no rates for 00:02:6f:47:f1:0a?

As a result of the "no rates" the code will run into an error condition in ath/if_ath.c:

if (txrate == 0) {
                /* Drop frame, if the rate is 0.
                 * Otherwise this may lead to the continuous transmission of
                 * noise. */
                printk("%s: invalid TX rate %u (%s: %u)\n", dev->name,
                        txrate, __func__, __LINE__);
                return -EIO;
        }

after this the cleanup function cleanup_ath_buf_debug() will be called, this is where the kernel oopses:

        if (bf->bf_skbaddr) {
                bus_unmap_single(
                        sc->sc_bdev,
                        bf->bf_skbaddr,
                        (direction == BUS_DMA_FROMDEVICE ?
                                sc->sc_rxbufsize : bf->bf_skb->len),
                        direction);
                bf->bf_skbaddr = 0;
                bf->bf_desc->ds_link = 0;
                bf->bf_desc->ds_data = 0;
        }

in my case of "no rates" when sending data to the client the direction is BUS_DMA_TODEVICE but bf->bf_skb is NULL, hence the crash when dereferencing bf->bf_skb->len.

As a workaround I tried to simply handle the "no rates" case in ath_tx_start() by setting the txrate to either

                rix = sc->sc_minrateix;
                txrate = rt->info[rix].rateCode;

or

vap->iv_mcast_rate

But while this helps preventing the oops by not running into the above error situation, it shows a memory leak. I am not sure if the leak is caused by "my fix" or if the leak is caused by the same problem that initially caused the "no rates".

I have tried to dig deeper but didn't get very far. I did notice the "ath_rate_sample: no rates for 00:02:6f:47:f1:0a?" seems to get triggered by an ath_tx_start() that is called before ieee80211_node_join() was done. Since I was looking for a memory leak I looked at the addresses of the ath_nodes when ath_rate_sample is called for the station I want to send data to and when the join() is done - but that is the same address, so there are NOT two nodes being created for the frame when the STA hasn't yet been fully associated at tx-time.

But I am rather helpless as to where look from here.

I have attached my test-tool that I use to reproduce this. All it does is create a VAP in AP mode and send as many ethernet packets as possible to the client's Mac-Address. As soon as the client associates to the AP the Oops will happen (or with my workaround I will see the "ath_rate_sample: no rates for 00:02:6f:47:f1:0a?" and then have lost about 4k of memory.

I am calling my tool with "wlantest -c 11 -m 00:02:6f:47:f1:0a" and then associate a secondary box as a client to my AP on channel 11. One of my observations is, some brands of clients do not seem to be able to trigger this situation, but with all Atheros or Intel Cards/Drivers I can reproduce this 100%. While my tool seems to trigger this bug in an odd situation (sending to a MAC address that isn't yet known to the system) I have seen the exact same problem in the wild, just not reproducible so easily. I am also getting the "no rates" problem in another setup with a regular client connecting to the AP with WEP in "open" mode, but I am not done debugging this. All I know it is the same "no rates" error that later would cause the oops. In this setup I am not doing anything weird like sending directly to a not-yet-associated MAC but having a regular laptop trying to surf the Web.

I would love to help debug this further, but am out of ideas where to dig further right now.

joerg

Attachments

wlantest.tar.gz (4.3 kB) - added by joerg@duck.franken.de on 04/21/08 13:51:24.
my test tool to trigger the oops within 20 seconds
oops-avila-2.6.23.12-r3538.txt (5.4 kB) - added by joerg@duck.franken.de on 04/21/08 15:47:48.
decoded kernel oops on IXP425 box
oops-ixp425-2.6.21.4-r3624.txt (6.9 kB) - added by ahmet@thbluezone.com on 05/14/08 23:04:04.
17SEP_AE2_HE_T1_C1.txt (132.0 kB) - added by karthikg@deccantechnosoft.com on 09/17/08 18:40:11.
Sample RCA Crashes.

Change History

04/21/08 13:51:24 changed by joerg@duck.franken.de

  • attachment wlantest.tar.gz added.

my test tool to trigger the oops within 20 seconds

04/21/08 14:08:16 changed by mrenzmann

Could you please provide a dump of the (decoded) oops message for this issue? DevDocs/KernelOops might help.

04/21/08 15:47:48 changed by joerg@duck.franken.de

  • attachment oops-avila-2.6.23.12-r3538.txt added.

decoded kernel oops on IXP425 box

04/21/08 15:49:20 changed by joerg@duck.franken.de

I have just re-created the situation on the Gateworks IXP425 Box and have attached the oops. Kernel is "Linux avila 2.6.23.12 #2 Mon Apr 21 15:20:55 CEST 2008 armv5teb unknown", madwifi is svn r3538 without any patches.

But while trying to get the oops captured I noticed something that never occured to me before as I was quick with adding "my fix" to the sources. On the IXP425 (big endian) I get the oops, on my i386 test box I only get

ath_rate_sample: no rates for 00:0b:6b:37:ed:b8?
wifi0: ath_tx_start: Invalid transmission rate, 0.

but no oops - yet my memory gets tighter when this happens (at least I assume the decrease is happening when this happens , if I don't run the test I am not observing a constant decrease in free mem over hours)

04/24/08 16:11:15 changed by madxray@gmx.net

I have the same behaviour on Meraki with r3348 and two patches from OpenWRT: * 100-kernel_cflags.patch (no influence on rates, only to make madwifi compile) * 111-minstrel_crash.patch (avoid bad rc1 panic: return if sn->num_rates < 0 )

Oops msg needed?

05/14/08 22:59:45 changed by ahmet@thebluezone.com

Same thing happening to me with Madwifi r3624 and kernel 2.6.21.4 on IXP425 board. Any fixes for this issue? I am attaching the dump as well.

05/14/08 23:04:04 changed by ahmet@thbluezone.com

  • attachment oops-ixp425-2.6.21.4-r3624.txt added.

09/17/08 18:35:53 changed by karthikg@deccantechnosoft.com

Same thing happening to me with Madwifi r3314 with updates Kernel 2.6.26.3 on IXP425 board having 3 ubiquiti SR5 radios

09/17/08 18:40:11 changed by karthikg@deccantechnosoft.com

  • attachment 17SEP_AE2_HE_T1_C1.txt added.

Sample RCA Crashes.

(follow-up: ↓ 7 ) 09/18/08 22:05:38 changed by mtaylor@emeraldcave.net

The problem appears to be that the DMA mapping for the buffer still exists while the skb has been freed or not set. I'll dig into this a little and let you know what I find. Has anyone reproduced this with a board other than

(in reply to: ↑ 6 ) 12/17/08 01:36:23 changed by anonymous

Replying to mtaylor@emeraldcave.net:

The problem appears to be that the DMA mapping for the buffer still exists while the skb has been freed or not set. I'll dig into this a little and let you know what I find. Has anyone reproduced this with a board other than

Looks like it only happens on XScale processors. I'm using IXP435 processor, kernel 2.6.24, madwifi trunk 3314 with OpenWRT patch, Atheros HAL provided by OpenWrt?, DD-WRT and MakSat? Technologies. I'm having the same problem here, kernel oops.

12/18/08 06:28:19 changed by anonymous

When txrate == 0, ath_hardstart function jumps to hardstart_fail, then it sets bf_skb to NULL, then calls clean up function and eventually calls bus_unmap_single where it tries to pass bf->bf_skb->len as parameter. Since bf_skb was set to NULL earlier, dereferencing a NULL pointer caused the kernel oops.

ath_hardstart(struct sk_buff *skb, struct net_device *dev)
{
...
hardstart_fail:
	/* Clear all SKBs from the buffers, we will clear them separately IF
	 * we do not requeue them. */
	ATH_TXBUF_LOCK_IRQ(sc);
	STAILQ_FOREACH_SAFE(tbf, &bf_head, bf_list, tempbf) {
		tbf->bf_skb = NULL;
	}
	ATH_TXBUF_UNLOCK_IRQ(sc);
	/* Release the buffers, now that skbs are disconnected */
	ath_return_txbuf_list(sc, &bf_head);
...
}


cleanup_ath_buf(struct ath_softc *sc, struct ath_buf *bf, int direction)
{
...
		bus_unmap_single(
			sc->sc_bdev,
			bf->bf_skbaddr,
			(direction == BUS_DMA_FROMDEVICE ?
				sc->sc_rxbufsize : bf->bf_skb->len),
			direction);
...
}

How to fix this?

01/21/09 00:08:45 changed by anonymous

This bug is a big problem for me. I can very reliably reproduce this problem by making a VoIP call through wifi and when the call is connected, simply walk out of wifi range. Does anyone know how to fix this bug?

08/19/09 09:51:49 changed by anonymous

Iam also facing this same problem