Please note: This project is no longer active. The website is kept online for historic purposes only.
If you´re looking for a Linux driver for your Atheros WLAN device, you should continue here .

Ticket #2451 (new defect)

Opened 11 years ago

Last modified 11 years ago

[patch] Random madwifi AP crashes

Reported by: Przemek Bruski Assigned to:
Priority: major Milestone:
Component: madwifi: driver Version: trunk
Keywords: Cc:
Patch is attached: 1 Pending: 0

Description

Ever since I've bought a new netbook, I started experiencing crashes on my AP, which used to run pretty stable. I've tracked the problem down to a bus_unmap_single called on a NULL bf->bf_skb. There are two cases where an encapsulation failure may lead to bf_skb being NULL. Additionally, in both cases the unencapsulated skb is lost, which would probably lead to a memory leak after the "unmap on NULL" problem was fixed. The attached patch fixed the issue for me and my AP's stability is rock solid again.

Attachments

patch.diff (2.4 kB) - added by Przemek Bruski on 03/13/11 14:18:29.

Change History

03/13/11 14:18:29 changed by Przemek Bruski

  • attachment patch.diff added.

03/13/11 14:20:16 changed by Przemyslaw Bruski

Signed-off-by: Przemyslaw Bruski <pbruskispam at op.pl>

(follow-up: ↓ 3 ) 03/14/11 16:16:18 changed by virnik@gmail.com

What distribution is panicking this way? I am using week old trunk of madwifi on my wireless router, running Ubuntu Server 10.10, and my system freezes (kernel panic) randmily from time to time...I can say for sure, that it freezes between one to 24h our of uptime.

That is why I am asking. Will try apply your patch, and then I will post back.

(in reply to: ↑ 2 ; follow-up: ↓ 4 ) 03/14/11 20:31:12 changed by anonymous

Replying to virnik@gmail.com:

What distribution is panicking this way? I am using week old trunk of madwifi on my wireless router, running Ubuntu Server 10.10, and my system freezes (kernel panic) randmily from time to time...I can say for sure, that it freezes between one to 24h our of uptime. That is why I am asking. Will try apply your patch, and then I will post back.

It can happen on any distribution, but answering your question: I had problems on current and previous Debian. It started happening very often once I started using a Broadcom-based netbook.

(in reply to: ↑ 3 ; follow-up: ↓ 5 ) 03/14/11 22:20:00 changed by virnik@gmail.com

Replying to anonymous:

It can happen on any distribution, but answering your question: I had problems on current and previous Debian. It started happening very often once I started using a Broadcom-based netbook.

That is very strange, that you can actually use it with debian. I am using Ubuntu, which is still very close to debian, and in my case, driver is loaded, but do not work (connected, but no data can go through). I have tested your patch, and it does improve stability, but I can't apply it to the madwifi-0.9.4-r4136-20110203.tar.gz, which works right, except for kernel panics

(in reply to: ↑ 4 ; follow-up: ↓ 6 ) 03/14/11 23:00:11 changed by Przemek Bruski

Replying to virnik@gmail.com:

Replying to anonymous:

It can happen on any distribution, but answering your question: I had problems on current and previous Debian. It started happening very often once I started using a Broadcom-based netbook.

That is very strange, that you can actually use it with debian. I am using Ubuntu, which is still very close to debian, and in my case, driver is loaded, but do not work (connected, but no data can go through). I have tested your patch, and it does improve stability, but I can't apply it to the madwifi-0.9.4-r4136-20110203.tar.gz, which works right, except for kernel panics

Well, I've been using madwifi trunk for years on Debian in AP mode and on Ubuntu in managed mode, it always worked fine on both. Are you using Network Manager or manual config ? Maybe the version you have is patched: has different kernel requirements or device names?

(in reply to: ↑ 5 ; follow-up: ↓ 7 ) 03/15/11 10:43:51 changed by virnik@gmail.com

Replying to Przemek Bruski:

Well, I've been using madwifi trunk for years on Debian in AP mode and on Ubuntu in managed mode, it always worked fine on both. Are you using Network Manager or manual config ? Maybe the version you have is patched: has different kernel requirements or device names?

No, I am using standard Ubuntu Server 10.10, i386. Router has two CM9 Atheros AR5001X+ cards. One in AP mode, second in STA mode. AP Mode is running 800.11b standard, STA mode card is running 802.11a standard.

Running 2.6.35-27-generic kernel, i386. Problem is, that my madwifi compiled from madwifi-0.9.4-r4136-20110203.tar.gz package works, but it just crashes from time to time (lets say that I have one crash per day). Except for crashes, it works excelent.

Actual madwifi trunk downloaded via svn/git can be easily compiled, your patch can be easily applied, and driver itself loads fine after compilation. No errors in dmesg or kernel log. But connections stalls, which means that I am associated, but no data can go through.

(in reply to: ↑ 6 ; follow-up: ↓ 8 ) 03/16/11 23:20:20 changed by Przemek Bruski

Replying to virnik@gmail.com:

But connections stalls, which means that I am associated, but no data can go through.

But that happens on vanilla trunk as well, right?

(in reply to: ↑ 7 ) 03/17/11 10:14:29 changed by virnik@gmail.com

Replying to Przemek Bruski:

But that happens on vanilla trunk as well, right?

Yes, right. That is not your fault or fault of your patch. I am asking, because I am curious, why it works for you, and I have such problems. Or do you have older trunk? Can you send it to me? Or can you please write down here revision hash, so I can select which revision I should download?

03/17/11 11:44:12 changed by virnik@gmail.com

Anyway, can you please modify your patch so it can be applied to actual 0.9.4 stable version? I mean revision 4136.

I have done some searching, and in most cases, problem rises within IRQ sharing, for example when one IRQ is shared between usb and wifi0, and there is heavy traffic on the wifi0 iwace (ath0 iface), and in the same time external HDD is connected on the USB, system crashes. It crashes randomily from time to time, when that irq is used....so for now, I had to disable usb support in the kernel.

(follow-up: ↓ 11 ) 03/17/11 14:31:07 changed by virnik@gmail.com

So to our problem with stability: I have downloaded madwifi-hal-0.10.5.6-r4126-20100324.tar.gz which due to kernel headers changes can't be compiled on new distros (in my case, Ubuntu 10.10). Then, I have applied this patch:

--- ath/if_ath.c        2010-01-18 15:21:22.000000000 +0100
+++ ath/if_ath.c        2011-03-17 13:59:52.000000000 +0100
@@ -4257,7 +4257,7 @@
 {
        struct ieee80211com *ic = &sc->sc_ic;
        struct ieee80211vap *vap;
-       struct dev_mc_list *mc;
+       struct netdev_hw_addr *ha;
        u_int32_t val;
        u_int8_t pos;
 
@@ -4265,11 +4265,11 @@
        /* XXX locking */
        TAILQ_FOREACH(vap, &ic->ic_vaps, iv_next) {
                struct net_device *dev = vap->iv_dev;
-               for (mc = dev->mc_list; mc; mc = mc->next) {
+               netdev_for_each_mc_addr(ha, dev) {
                        /* calculate XOR of eight 6-bit values */
-                       val = LE_READ_4(mc->dmi_addr + 0);
+                       val = LE_READ_4(ha->addr + 0);
                        pos = (val >> 18) ^ (val >> 12) ^ (val >> 6) ^ val;
-                       val = LE_READ_4(mc->dmi_addr + 3);
+                       val = LE_READ_4(ha->addr + 3);
                        pos ^= (val >> 18) ^ (val >> 12) ^ (val >> 6) ^ val;
                        pos &= 0x3f;
                        mfilt[pos / 32] |= (1 << (pos % 32));

to make it compile again, and afterwards, your patch:

--- ath/if_ath.c        (revision 4136)
+++ ath/if_ath.c        (working copy)
@@ -3277,6 +3277,7 @@
        int (*ath_ff_flushdonetest)(struct ath_txq *txq, struct ath_buf *bf))
 {
        struct ath_buf *bf_ff = NULL;
+       struct sk_buff *bf_skb_encap = NULL;
        unsigned int pktlen;
        int framecnt;
 
@@ -3297,14 +3298,15 @@
                ATH_TXQ_UNLOCK_IRQ(txq);
 
                /* encap and xmit */
-               bf_ff->bf_skb = ieee80211_encap(ATH_BUF_NI(bf_ff), bf_ff->bf_skb, 
+               bf_skb_encap = ieee80211_encap(ATH_BUF_NI(bf_ff), bf_ff->bf_skb, 
                                &framecnt);
-               if (bf_ff->bf_skb == NULL) {
+               if (bf_skb_encap == NULL) {
                        DPRINTF(sc, ATH_DEBUG_XMIT | ATH_DEBUG_FF,
                                "Dropping; encapsulation failure\n");
                        sc->sc_stats.ast_tx_encap++;
                        goto bad;
                }
+               bf_ff->bf_skb = bf_skb_encap;
                pktlen = bf_ff->bf_skb->len;    /* NB: don't reference skb below */
                if (ath_tx_start(sc->sc_dev, ATH_BUF_NI(bf_ff), bf_ff, 
                                        bf_ff->bf_skb, 0) == 0)
@@ -3475,6 +3477,7 @@
        struct sk_buff *original_skb  = __skb; /* ALWAYS FREE THIS ONE!!! */
        struct ath_node *an;
        struct sk_buff *skb = NULL;
+       struct sk_buff *bf_skb_encap = NULL;
        /* We will use the requeue flag to denote when to stuff a skb back into
         * the OS queues.  This should NOT be done under low memory conditions,
         * such as skb allocation failure.  However, it should be done for the
@@ -3655,14 +3658,15 @@
                        ATH_TXQ_UNLOCK_IRQ_EARLY(txq);
 
                        /* Encap. and transmit */
-                       bf_ff->bf_skb = ieee80211_encap(ni, bf_ff->bf_skb, 
+                       bf_skb_encap = ieee80211_encap(ni, bf_ff->bf_skb, 
                                        &framecnt);
-                       if (bf_ff->bf_skb == NULL) {
+                       if (bf_skb_encap == NULL) {
                                DPRINTF(sc, ATH_DEBUG_XMIT,
                                        "Dropping; fast-frame flush encap. "
                                        "failure\n");
                                sc->sc_stats.ast_tx_encap++;
                        } else {
+                               bf_ff->bf_skb = bf_skb_encap;
                                pktlen = bf_ff->bf_skb->len;    /* NB: don't reference skb below */
                                if (!ath_tx_start(dev, ni, bf_ff, 
                                                        bf_ff->bf_skb, 0))
@@ -12471,12 +12475,14 @@
                return bf;
 
        if (bf->bf_skbaddr) {
-               bus_unmap_single(
-                       sc->sc_bdev,
-                       bf->bf_skbaddr, 
-                       (direction == BUS_DMA_FROMDEVICE ? 
-                               sc->sc_rxbufsize : bf->bf_skb->len),
-                       direction);
+               if (bf->bf_skb) {
+                       bus_unmap_single(
+                               sc->sc_bdev,
+                               bf->bf_skbaddr, 
+                               (direction == BUS_DMA_FROMDEVICE ? 
+                                sc->sc_rxbufsize : bf->bf_skb->len),
+                               direction);
+               }
                bf->bf_skbaddr = 0;
                bf->bf_desc->ds_link = 0;
                bf->bf_desc->ds_data = 0;

Now it seems to be rock solid. Thanks you very much.

Anyway, can you please tell me, if I am right, when I am saying that madwifi-hal branch is newer then 0.9.4?

(in reply to: ↑ 10 ) 03/17/11 22:26:27 changed by Przemek Bruski

I think that hal* branches contain more features but 0.9.4 was kept updated to compile against recent kernels - so yes and no.

(follow-up: ↓ 13 ) 03/18/11 00:16:48 changed by virnik@gmail.com

Thanks for your reply. I knew I am right.... Let's say that HAL version should be more tuned, but nobody worried about updating it against new kernels. 0.9.4 is version in which developers plays who has longer penis, so it is little bit unstable :-)

I am still currious, why last trunk runs under your Broadcom chipset configuration, and not with my pure Atheros CM9 - AR5001X+ combo.

(in reply to: ↑ 12 ; follow-up: ↓ 14 ) 03/18/11 01:24:11 changed by Przemek Bruski

Replying to virnik@gmail.com:

HAL branches were temporary merge branches for merging firmware blobs from Atheros. 0.9.4 is the stable branch. Trunk is where the latest and greatest version resides.

I am still currious, why last trunk runs under your Broadcom chipset configuration, and not with my pure Atheros CM9 - AR5001X+ combo.

I typically run my AP with two VAPs (Atheros) + one client (also Atheros). Broadcom-based client (and other clients) are rarely used, so it seems that the main difference is that you have two NICs in your AP.

(in reply to: ↑ 13 ) 03/18/11 08:30:55 changed by virnik@gmail.com

Replying to Przemek Bruski:

Replying to virnik@gmail.com: HAL branches were temporary merge branches for merging firmware blobs from Atheros. 0.9.4 is the stable branch. Trunk is where the latest and greatest version resides.

I am still currious, why last trunk runs under your Broadcom chipset configuration, and not with my pure Atheros CM9 - AR5001X+ combo.

I typically run my AP with two VAPs (Atheros) + one client (also Atheros). Broadcom-based client (and other clients) are rarely used, so it seems that the main difference is that you have two NICs in your AP.

OK, thanks. Now I do understand. Sadly, trunk do not work for me, STA running trunk can't associate to another AP (Atheros too, but with Mikrotik), and AP running trunk transmits, but noone is able to connect. It is unlikely, that such minor difference like number of NICs used in my system can do any harm, but so be it. 0.9.4 works fine, but crashes my AP very often. I can't even collect kernel dump, because it just freeze (no strace, no kernel oops, no kernel panic error message). On other hand, MadWifi HAL version 0.10.5.6 can't be compiled against new kernel without patch, which I have provided (not 100% my work), and after applying your patch, it works flawlessly.