I can reproduce a kernel crash with madwifi versions between at least r3314 and r3543 in AP mode, probably even sooner and also later. 0.9.4 is also affected by this bug. Tried this on a Gateworks IXP425 board and a PC-Engines WRAP board with a CM9, Senao NMP8602 and a NMP8602-PLUS card.
The software on the test box is sending ethernet frames to a client that is not yet connected or was connected, got out of reach and is re-associating. Prior to the oops is a problem with rating the client (this happens with at least ath_rate_sample and ath_rate_onoe):
ath_rate_sample: no rates for 00:02:6f:47:f1:0a?
As a result of the "no rates" the code will run into an error condition in ath/if_ath.c:
if (txrate == 0) {
/* Drop frame, if the rate is 0.
* Otherwise this may lead to the continuous transmission of
* noise. */
printk("%s: invalid TX rate %u (%s: %u)\n", dev->name,
txrate, __func__, __LINE__);
return -EIO;
}
after this the cleanup function cleanup_ath_buf_debug() will be called, this is where the kernel oopses:
if (bf->bf_skbaddr) {
bus_unmap_single(
sc->sc_bdev,
bf->bf_skbaddr,
(direction == BUS_DMA_FROMDEVICE ?
sc->sc_rxbufsize : bf->bf_skb->len),
direction);
bf->bf_skbaddr = 0;
bf->bf_desc->ds_link = 0;
bf->bf_desc->ds_data = 0;
}
in my case of "no rates" when sending data to the client the direction is BUS_DMA_TODEVICE but bf->bf_skb is NULL, hence the crash when dereferencing bf->bf_skb->len.
As a workaround I tried to simply handle the "no rates" case in ath_tx_start() by setting the txrate to either
rix = sc->sc_minrateix;
txrate = rt->info[rix].rateCode;
or
vap->iv_mcast_rate
But while this helps preventing the oops by not running into the above error situation, it shows a memory leak. I am not sure if the leak is caused by "my fix" or if the leak is caused by the same problem that initially caused the "no rates".
I have tried to dig deeper but didn't get very far. I did notice the "ath_rate_sample: no rates for 00:02:6f:47:f1:0a?" seems to get triggered by an ath_tx_start() that is called before ieee80211_node_join() was done. Since I was looking for a memory leak I looked at the addresses of the ath_nodes when ath_rate_sample is called for the station I want to send data to and when the join() is done - but that is the same address, so there are NOT two nodes being created for the frame when the STA hasn't yet been fully associated at tx-time.
But I am rather helpless as to where look from here.
I have attached my test-tool that I use to reproduce this. All it does is create a VAP in AP mode and send as many ethernet packets as possible to the client's Mac-Address. As soon as the client associates to the AP the Oops will happen (or with my workaround I will see the "ath_rate_sample: no rates for 00:02:6f:47:f1:0a?" and then have lost about 4k of memory.
I am calling my tool with "wlantest -c 11 -m 00:02:6f:47:f1:0a" and then associate a secondary box as a client to my AP on channel 11. One of my observations is, some brands of clients do not seem to be able to trigger this situation, but with all Atheros or Intel Cards/Drivers I can reproduce this 100%. While my tool seems to trigger this bug in an odd situation (sending to a MAC address that isn't yet known to the system) I have seen the exact same problem in the wild, just not reproducible so easily. I am also getting the "no rates" problem in another setup with a regular client connecting to the AP with WEP in "open" mode, but I am not done debugging this. All I know it is the same "no rates" error that later would cause the oops. In this setup I am not doing anything weird like sending directly to a not-yet-associated MAC but having a regular laptop trying to surf the Web.
I would love to help debug this further, but am out of ideas where to dig further right now.
joerg