Please note: This project is no longer active. The website is kept online for historic purposes only.
If you´re looking for a Linux driver for your Atheros WLAN device, you should continue here .

Ticket #907 (closed task: fixed)

Opened 13 years ago

Last modified 10 years ago

Node locking and managment hash-up

Reported by: mentor Assigned to: mentor
Priority: critical Milestone: version 0.9.5
Component: madwifi: 802.11 stack Version:
Keywords: locking node Cc:
Patch is attached: 1 Pending:

Description

Node reference counting looks broken due to the confusion between node references and node memory allocation/deallocation. Also, we seem to be holding some faux references to maintain the nodes' presence when authenticated and/or associated.

I have added a spin lock to each individual node to protect its reference count. Each node table also has a seperate lock. I have also rehashed the internal node allocation and external node join/leave functions. Thus the node AREF has been removed.

This update significantly changes the interface of ieee80211_free_node() -> ieee80211_unref_node(), and I will probably propose some other general API changes when passing references around to clarify things, if this patch's approach is condoned.

This is not a clean patch as I've buggered up some comments, and made some minor code changes that should not be in the patch. No prizes for spotting where.

Attachments

madwifi-refcnt.diff (35.7 kB) - added by mentor on 09/21/06 04:13:26.
madwifi-refcnt.2.diff (52.4 kB) - added by mentor on 09/21/06 04:13:52.
madwifi-refcnt.4.diff (62.1 kB) - added by mentor on 09/26/06 01:41:51.
madwifi-refcnt.5.diff (62.3 kB) - added by mentor on 09/26/06 03:04:06.
madwifi-parent.6.diff (5.5 kB) - added by mentor on 01/21/07 23:09:22.
madwifi-refcnt.6.diff (63.5 kB) - added by mentor on 01/21/07 23:16:45.
madwifi-refcnt.7.diff (64.2 kB) - added by mentor on 01/22/07 12:49:24.
madwifi-refcnt.8.diff (64.6 kB) - added by mentor on 01/22/07 20:11:19.
taylor_crash.txt (11.8 kB) - added by mike.taylor@apprion.com on 01/23/07 00:57:48.
dmesg output after kernel bug + SIGSEGV using wlanconfig ath create
preempt-has-no-spinlocks-here.diff (2.8 kB) - added by Mark Glines <mark@glines.org> on 01/23/07 01:46:38.
disable spinlock checks on PREEMPT && !SMP
preempt-has-no-spinlocks-here2.diff (3.0 kB) - added by Mark Glines <mark@glines.org> on 01/23/07 19:16:13.
same diff, signed off properly this time
madwifi-0.9.2-jpo.diff (3.8 kB) - added by anonymous on 01/25/07 09:49:46.
Patch to change associated_sta file to madwifi-old format
refcnt (12.8 kB) - added by anonymous on 01/29/07 11:46:45.
leak trace on 0.9.2 compressed with bzip2 (site refuses .bz2 extension)
madwifi-refcnt.9.diff (78.0 kB) - added by mentor on 01/31/07 20:13:52.
madwifi-refcnt-push-r2088.diff (18.7 kB) - added by georg@boerde.de on 02/07/07 13:42:43.
cumulative diff to merge changes in HEAD up to r2088 into madwifi-ng-refcount
madwifi-refcnt-push-r2115.diff (5.2 kB) - added by mike.taylor@apprion.com on 02/14/07 21:12:10.
cumulative diff to merge changes in HEAD up to r2115 into madwifi-ng-refcount
madwifi-refcnt.10.diff (85.1 kB) - added by mike.taylor@apprion.com on 02/14/07 21:14:37.
This patch will upgrade a build from the trunk r2115 with all changes from the madwifi-ng-refcount branch, including the rollup of changes to 2115.
madwifi-refcnt.11.diff (91.2 kB) - added by mike.taylor@apprion.com on 02/14/07 21:40:20.
Same as 10, but adds ieee80211_debug.h which was missing in the previous patch.
madwifi-refcount-ng-debug-dependency-cycle.diff (3.0 kB) - added by mike.taylor@apprion.com on 04/04/07 23:56:44.
Signed patch describing/resolving cycle in ieee 80211 debugging code

Change History

09/21/06 04:13:26 changed by mentor

  • attachment madwifi-refcnt.diff added.

09/21/06 04:13:52 changed by mentor

  • attachment madwifi-refcnt.2.diff added.

09/26/06 01:41:51 changed by mentor

  • attachment madwifi-refcnt.4.diff added.

09/26/06 03:04:06 changed by mentor

  • attachment madwifi-refcnt.5.diff added.

09/26/06 03:04:51 changed by mentor

Last one should fix KASSERT failure in node_set_chan.

09/26/06 03:04:59 changed by mentor

  • status changed from new to assigned.

09/28/06 23:42:50 changed by dyqith

Some comments:

1. Looks like _ieee80211_free_node() isn't used anymore 2. How is inactivity timeouts being changed here? (The assumption is when the timeout hits, only one ref cnt is active for the node. Is this true?) 3. Should an assert be added to node_free() to make sure refcnts are 0 ?

that's all the comments I have for now, if you have anything in particular I should look over, please let me know.

I haven't "run" the code yet. --dyqith

10/08/06 18:45:41 changed by mentor

1. It's called from _ieee80211_unref_node if the reference count drop to zero.

2. I haven't changed the inactivity timeout; I don't understand what you are saying. Although one thing that may be an issue is that the inactivity timeout now operates on unassociated but authenticated nodes?

3. Added.

10/23/06 04:54:43 changed by mrenzmann

Loosely related to the topic of this ticket: a user reported a missing unlock in the 802.11 stack.

01/16/07 17:00:27 changed by rozteck@interia.pl

I found that this patch (with some modifications) solves crashes caused by incorect pointers in many places of the code. It seems for me that the patch should be updated to the latest revision and applied to the trunk.

01/18/07 21:02:54 changed by Mark Glines <mark@glines.org>

Hi,

Indeed, the patch seems to help a lot.

I have a couple of comments about the inline ieee80211_unref_node function introduced in madwifi-refcnt.5.diff:

294	ieee80211_unref_node(struct ieee80211_node **pni) 
295	#endif 
296	{        
297	        struct ieee80211_node *ni = *pni; 
298	#ifdef IEEE80211_DEBUG_REFCNT 
299	        IEEE80211_DPRINTF(ni->ni_vap, IEEE80211_MSG_NODE, 
300	                "%s (%s:%u) %p<%s> refcnt %d\n", __func__, func, line, ni, 
301	                 ether_sprintf(ni->ni_macaddr), ieee80211_node_refcnt(ni) - 1); 
302	#endif 
303	        IEEE80211_NODE_LOCK_IRQ(ni); 
304	        _ieee80211_unref_node(ni); 
305	        IEEE80211_NODE_UNLOCK_IRQ(ni); 

If "ni" has been freed, then IEEE80211_NODE_UNLOCK_IRQ will fail to read ni->ni_nodelock (a spinlock it tries to unlock), and will segfault. On non-SMP, spinlocks are noop'd out, so it seems to work there, but I think it will definitely crash on SMP.

306	        pni = NULL;                     /* guard against use */ 

I think you mean "*pni", not "pni"... the above line will just unset the local variable, not the caller's pointer.

01/21/07 00:06:13 changed by mentor

Thanks.

I'm sure I've fiddled around with that latter part repeatedly.

As to the former, I've just noticed that the destruction/freeing of the node lock is completely fucked anyway. I will fix that, and then apply.

01/21/07 09:06:55 changed by rozteck@interia.pl

I'm wondering about that piece of code from ieee80211_sta_join function:

        ni = ieee80211_find_node(&ic->ic_sta, vap->iv_myaddr);
        if (ni == NULL) {
                ni = ieee80211_alloc_node_table(vap, vap->iv_myaddr);
                if (ni == NULL) {
                        /* XXX recovery? */
                        return;
                }
        }
        else
                ieee80211_unref_node(&ni);
	/*
	 * Expand scan state into node's format.
	 * XXX may not need all this stuff
	 */
	ni->ni_authmode = vap->iv_bss->ni_authmode;
      [...]

If the node ni for the station will be found in the table it will be freed (when refcount == 1) and then few lines below will be used. For me that should segfault in this situation. Was this intentional? Maybe after ni is being freed should be allocated again? Or mayby made cleanup on it rather than removing? This is complicated for me - could you explain it to me?

01/21/07 20:00:27 changed by mentor

Yeah, that does appear to be wrong; fixed. (Did I introduce that error? Was I smoking crack at the time?) Please point out anything else you can spot. Quite a lot of the things in this file seem to be insane to me.

01/21/07 23:09:22 changed by mentor

  • attachment madwifi-parent.6.diff added.

01/21/07 23:16:45 changed by mentor

  • attachment madwifi-refcnt.6.diff added.

01/21/07 23:17:58 changed by mentor

Right. I think madwifi-refcnt.6.diff is pretty much my final answer.

01/22/07 06:52:43 changed by Mark Glines <mark@glines.org>

Ok, great! Hey, after fixing the *pni thing (see previous comment), I noticed another issue...

After changing "pni = NULL" to "*pni = NULL", as per my previous comment, I noticed a crash in ieee80211_sta_join(). Please look at the following code:

         ni = ieee80211_find_node(&ic->ic_sta, se->se_macaddr);
         if (ni == NULL) {
                 ni = ieee80211_alloc_node_table(vap, se->se_macaddr);
                 if (ni == NULL) {
                         /* XXX msg */
                         return 0;
                 }
         } else
                 ieee80211_unref_node(&ni);

         /*
          * Expand scan state into node's format.
          * XXX may not need all this stuff
          */
         ni->ni_authmode = vap->iv_bss->ni_authmode;             /* 
inherit authm
ode from iv_bss */

So, uh, unref the node (which now sets the pointer to NULL as a side effect, hence my crash), and then access ni->ni_authmode?

...and at the bottom of the function...

         IEEE80211_DPRINTF(vap, IEEE80211_MSG_NODE,
         "%s: %p<%s> refcnt %d\n", __func__, ni, 
ether_sprintf(ni->ni_macaddr),
         ieee80211_node_refcnt(ni) + 1);

         return ieee80211_sta_join1(ieee80211_ref_node(ni));

So if I understand this right... it manipulates the node without holding a reference, flatout lies about the reference count in the debug log, and then finally bumps the refcount as it passes control off to ieee80211_sta_join1(). Am I reading it right? This seems insane to me.

The semantics of this function seem very similar to what happens in ieee80211_create_ibss(). In that function, your patch gets rid of the broken ieee80211_free_node() entirely... here your patch replaces it with a (imho) broken ieee80211_unref_node. Was this on purpose? Is there any reason to free and reallocate the node here, or is it ok to just reuse it? I've patched it up temporarily to just use the existing node (and hold the extra reference throughout the life of the function, rather than only bumping it at the end), and that seemed to solve the crashes for me...

01/22/07 09:19:07 changed by rozteck@interia.pl

I've got another question. In ieee80211_remove_wds_addr (and in similar functions) you're doing:

LIST_REMOVE(wds, wds_hash); 
FREE(wds, M_80211_WDS); 
_ieee80211_unref_node(wds->wds_ni);

As I understand first you free wds and then try to unref the part of it - wds->wds_ni. This should (and does) crash. Maybe is better to change the order to:

LIST_REMOVE(wds, wds_hash); 
_ieee80211_unref_node(wds->wds_ni);
FREE(wds, M_80211_WDS); 

What do you think about that?

And other question - why in node_free function not add the ni = NULL; statement after the line FREE(ni, M_80211_NODE); ?

01/22/07 09:21:54 changed by rozteck@interia.pl

And what if _ieee80211_unref_node(wds->wds_ni) does not remove the node wds->wds_ni in ieee80211_remove_wds_addr because the refcount > 1? Maybe in such situation we shouldn't free the node? I think we shouldn't assume that the node will be removed by _ieee80211_unref_node(wds->wds_ni) but I may be wrong...

01/22/07 12:47:40 changed by mentor

Yes, you're right, and as I've just changed some of the locking, I don't even think we need to mess around with calling the back handler.

Adding ni = NULL, to free_node won't help because it is a local varaiable, and the function is only ever gets called via ieee80211_node_unref, which captures the pointer anyway.

In this case we're not freeing the node, but the WDS address entry; this is owned by those functions, so it's valid. The node itself will get cleared up when all its references disappear.

01/22/07 12:49:24 changed by mentor

  • attachment madwifi-refcnt.7.diff added.

01/22/07 16:45:06 changed by rozteck@interia.pl

Everything works ok but... The *pni = NULL assignment in ieee80211_unref_node is wrong - it makes the wlanconfig athX list command returning empty station list when the stations are connected. When switched to pni = NULL everything is ok. Btw. patch seems to work perfectly for me.

01/22/07 20:11:19 changed by mentor

  • attachment madwifi-refcnt.8.diff added.

01/22/07 20:13:20 changed by mentor

It appears that is because I fucked up iterate_nodes. Coudl you try the above. It doesn't help to have no test machine.

01/22/07 20:52:20 changed by rozteck@interia.pl

Quick comment - in ieee80211_iterate_nodes in TAILQ_FOREACH loop you should put IEEE80211_NODE_TABLE_UNLOCK_IRQ_EARLY(nt) instead of IEEE80211_NODE_TABLE_UNLOCK_IRQ(nt).

01/23/07 00:09:16 changed by mentor

*cries*

01/23/07 00:56:31 changed by mike.taylor@apprion.com

FYI - I switched to the trunk (r1994) so that I could try this patch. Tested that revision and t hings were about the same.

I installed the patch (revision 8 with the fix mentioned by rozteck). The result was SIGSEGV from wlanconfig during the vap creation. I went back to rev 7 and got the same results.

See taylor_crash.txt.

01/23/07 00:57:48 changed by mike.taylor@apprion.com

  • attachment taylor_crash.txt added.

dmesg output after kernel bug + SIGSEGV using wlanconfig ath create

01/23/07 01:46:38 changed by Mark Glines <mark@glines.org>

  • attachment preempt-has-no-spinlocks-here.diff added.

disable spinlock checks on PREEMPT && !SMP

01/23/07 06:21:16 changed by Mark Glines <mark@glines.org>

Hi Mike,

Please try the preempt-has-no-spinlocks-here.diff I just attached. I'm running linux-2.6.18-rt7 here on a uniprocessor box, with CONFIG_PREEMPT_DESKTOP=y. Spinlocks are noops in my config, so spin_is_locked() always returns 0. This results in crashes identical to the one you uploaded.

Mark

01/23/07 10:36:38 changed by rozteck@interia.pl

Yes, the preempt-has-no-spinlocks-here.diff is required to make it work. I've also done this change to make it work. Mentor please consider including the patch from Mark.

01/23/07 18:47:37 changed by mike.taylor@apprion.com

Thanks. I'm testing now. I'll let you know after a station reassociation, which is probably where my crashes were happening.

BTW, the null pointer check hack in madwifi.org/ticket/1070 seems to be related and links back to this ticket. With this patch, should 1070 be taken care of? If so, we may want to close that ticket and point it to this one. If not, we may want to include the patch from 1070 in this one, since the patches overlap and whichever one runs second will fail. I'm testing without it on the assumption that the corruption will be resolved by this patch.

Let me know if my assumption is wrong.

- M

01/23/07 19:16:13 changed by Mark Glines <mark@glines.org>

  • attachment preempt-has-no-spinlocks-here2.diff added.

same diff, signed off properly this time

01/23/07 20:27:17 changed by mike.taylor@apprion.com

I'm cautiously optimistic, but it looks good...

r1994 + madwifi-refcnt.8.diff + preempt-has-no-spinlocks-here.diff.

01/23/07 20:29:07 changed by Mark Glines <mark@glines.org>

Yeah, me too. I'm unable to break this build. :)

Mark

01/24/07 07:09:50 changed by rozteck@interia.pl

I've got some automatic tests like kick the station 1000 times after each 10s and check if it reassociate, change some parameters and see if it works, etc. The build have survived them - builds without the patch not.

01/24/07 07:18:03 changed by rozteck@interia.pl

This patch solves the problems reported in tickets: #1070, #1071, #1072, #1073, #1074, #1078. I think that we can close those tickets now.

01/24/07 12:32:54 changed by mrenzmann

These tickets will be closed as soon as the patch that fixes these issues has been committed.

01/24/07 15:07:28 changed by rozteck@interia.pl

I found one problem with that patch - It seems that it broke somehow the signal/noise level reported by iwconfig...

I have made a build without patch and the reported values are: Link Quality=53/94 Signal level=-43 dBm Noise level=-96 dBm while on build with that patch are: Link Quality=53/94 Signal level=-203 dBm Noise level=-256 dBm

Could any one confirm that? Maybe I just have broken something else when patching the sources.

01/24/07 15:48:14 changed by rozteck@interia.pl

I was wrong. That seems to be some other change I've made to the sources... Sorry

01/25/07 09:49:46 changed by anonymous

  • attachment madwifi-0.9.2-jpo.diff added.

Patch to change associated_sta file to madwifi-old format

01/25/07 09:56:13 changed by anonymous

I have just attached the patch madwifi-0.9.2-jpo.diff. This patch mostly restores the associated_sta format from madwifi-old. The associated_sta-file now again shows the refcounts of the ieee80211_nodes. With madwifi-refcnt.8.diff applied I still see the refcount of the first node going continuously up in ad-hoc mode, so there still seems to be a reference leak.

01/26/07 20:05:08 changed by mentor

After a long trawl through dis-joint documents it appears spin_is_locked is only usefully defined for either CONFIG_SMP or CONFIG_DEBUG_SPINLOCK.

Looking at the ADHOC case now.

01/26/07 21:55:24 changed by mentor

@anonymous: Could you try turning IEEE80211_DEBUG_REFCNT on in net80211/ieee80211_var.h and getting a log from that?

01/26/07 22:10:53 changed by Mark Glines <mark@glines.org>

Hi mentor,

I'm not @anonymous, but now that I'm looking for it, I do also see an occasional refcount leak. Normal traffic doesn't seem to cause it, but I am seeing node refcounts leak at a rate of about 1 every 5 to 10 minutes. I'm running as an AP, and I am seeing all the stations leak, not just the first one. They don't all increment at the same time, either.

I tried doing an "80211debug node", but there are no obvious clues in that as to where the leak is coming from. The only code which actually announces that it is taking a reference is in ath_tx_start(), and that part doesn't seem to leak.

When I change "#undef IEEE80211_DEBUG_REFCNT" to "#define IEEE80211_DEBUG_REFCNT" as you suggested, I get:

In file included from /home/paranoid/madwifi-svn-trunk-907-8/ath/../net80211/ieee80211_var.h:51,
                 from /home/paranoid/madwifi-svn-trunk-907-8/ath/if_ath.c:70:
/home/paranoid/madwifi-svn-trunk-907-8/ath/../net80211/ieee80211_node.h: In function `ieee80211_unref_node_debug':
/home/paranoid/madwifi-svn-trunk-907-8/ath/../net80211/ieee80211_node.h:312: warning: implicit declaration of function `IEEE80211_DPRINTF'
/home/paranoid/madwifi-svn-trunk-907-8/ath/../net80211/ieee80211_node.h:312: error: `IEEE80211_MSG_NODE' undeclared (first use in this function)
/home/paranoid/madwifi-svn-trunk-907-8/ath/../net80211/ieee80211_node.h:312: error: (Each undeclared identifier is reported only once
/home/paranoid/madwifi-svn-trunk-907-8/ath/../net80211/ieee80211_node.h:312: error: for each function it appears in.)

It appears the IEEE80211_DPRINTF stuff defined in ieee80211_var.h is below the line where it includes ieee80211_node.h, so they are not yet defined when ieee80211_node.h tries to use them.

I tried fiddling with the order of things a bit in ieee80211_var.h, and so far I am unable to come up with a working combination. I'll keep trying. I think it will probably work better if I split the actual debugging inlines into their own header file, and include that later. (Or just make them normal functions, and stick them in ieee80211_node.c.)

Mark

01/27/07 16:07:16 changed by rozteck@interia.pl

@mentor if you are interested my friend got a crash with your patch applied as follows:

[ 5251.640000] Unable to handle kernel paging request at virtual address 00001148
[ 5251.640000] pgd = c0004000
[ 5251.640000] [00001148] *pgd=00000000
[ 5251.640000] Internal error: Oops: 17 [#
[ 5251.640000] Modules linked in: bridge llc ipt_proximetry sch_sfq sch_htb ipt_REJECT tun iptable_filter iptable_nat ip_nat_ftp ip_nat bonding e100 pcnet32 ppp_async crc_ccitt ppp_generic slhc wlan_scan_ap wlan_scan_sta ath_pci ath_rate_sample ath_hal wlan ath_dfs hdlc syncppp lapb ixp4xx cryptodev ocf ixp400_eth ixp400
[ 5251.640000] CPU: 0
[ 5251.640000] PC is at ieee80211_find_rxnode+0x38/0x88 [wlan]
[ 5251.640000] LR is at ath_rx_tasklet+0x740/0xa60 [ath_pci]
[ 5251.640000] pc : [<bf1c2fc8>]    lr : [<bf53ee04>]    Tainted: P     
[ 5251.640000] sp : c02ddefc  ip : 00001148  fp : c02ddf14
[ 5251.640000] r10: c3b14280  r9 : ffc00690  r8 : 00000000
[ 5251.640000] r7 : 00000020  r6 : c0213780  r5 : 60000013  r4 : c02dc000
[ 5251.640000] r3 : 00000101  r2 : c3b14c04  r1 : 00001148  r0 : c3b14280
[ 5251.640000] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  Segment kernel
[ 5251.640000] Control: 39FF  Table: 0263C000  DAC: 00000017
[ 5251.640000] Process softirq-tasklet (pid: 8, stack limit = 0xc02dc250)
[ 5251.640000] Stack: (0xc02ddefc to 0xc02de000)
[ 5251.640000] dee0:                                                                c3b15690 
[ 5251.640000] df00: 00000000 c0213780 c02ddf70 c02ddf18 bf53ee04 bf1c2f9c c005add8 c27fc01c 
[ 5251.640000] df20: c02c68a0 00000015 c3b15680 00000094 c276e0e0 c3d50000 c170d5a0 c3b14000 
[ 5251.640000] df40: c02dc000 c02c69bc c3b15690 00000000 c0213780 00000020 00000000 00000000 
[ 5251.640000] df60: c021378c c02ddf88 c02ddf74 c003c70c bf53e6d0 c02dc000 c02135e0 c02ddf98 
[ 5251.640000] df80: c02ddf8c c003c758 c003c6a0 c02ddfc4 c02ddf9c c003c96c c003c72c 00000001 
[ 5251.640000] dfa0: c0213780 c02dc000 c02c9f1c c003c85c fffffffc 00000000 c02ddff4 c02ddfc8 
[ 5251.640000] dfc0: c004c644 c003c868 00000001 ffffffff ffffffff 00000000 00000000 00000000 
[ 5251.640000] dfe0: 00000000 00000000 00000000 c02ddff8 c0038f2c c004c560 cc33cc33 cc33cc33 
[ 5251.640000] Backtrace: 
[ 5251.640000] [<bf1c2f90>] (ieee80211_find_rxnode+0x0/0x88 [wlan]) from [<bf53ee04>] (ath_rx_tasklet+0x740/0xa60 [ath_pci])
[ 5251.640000]  r6 = C0213780  r5 = 00000000  r4 = C3B15690 
[ 5251.640000] [<bf53e6c4>] (ath_rx_tasklet+0x0/0xa60 [ath_pci]) from [<c003c70c>] (__tasklet_action+0x78/0x8c)
[ 5251.640000] [<c003c694>] (__tasklet_action+0x0/0x8c) from [<c003c758>] (tasklet_action+0x38/0x40)
[ 5251.640000]  r5 = C02135E0  r4 = C02DC000 
[ 5251.640000] [<c003c720>] (tasklet_action+0x0/0x40) from [<c003c96c>] (ksoftirqd+0x110/0x1bc)
[ 5251.640000] [<c003c85c>] (ksoftirqd+0x0/0x1bc) from [<c004c644>] (kthread+0xf0/0x120)
[ 5251.640000] [<c004c554>] (kthread+0x0/0x120) from [<c0038f2c>] (do_exit+0x0/0x970)
[ 5251.640000]  r8 = 00000000  r7 = 00000000  r6 = 00000000  r5 = 00000000
[ 5251.640000]  r4 = 00000000 
[ 5251.640000] Code: e3c4403f e5943004 e2833001 e5843004 (e5dc3000) 

so it seems that some bug is still there...

01/29/07 11:40:07 changed by pommnitz@yahoo.com

Hello mentor, I'm the anonymous poster. I initially reported the problem last September (see gmane.linux.drivers.madwifi.devel/3203). At this time I did a trace (attachment follows). Currently I encounter the problems mentioned by Mark Glines. Since the leak seems still to be the same the 0.9.2 trace might still be useful.

The offending node has the MAC address 00:02:6f:23:87:22.

Regards

Joerg

01/29/07 11:46:45 changed by anonymous

  • attachment refcnt added.

leak trace on 0.9.2 compressed with bzip2 (site refuses .bz2 extension)

01/31/07 20:12:54 changed by mentor

Hum, found some leaks I think. No luck on the crash yet.

01/31/07 20:13:52 changed by mentor

  • attachment madwifi-refcnt.9.diff added.

02/01/07 11:19:04 changed by redbyte

Patch madwifi-refcnt.9.diff crashed with svn2059

Output:

patching file net80211/ieee80211_node.c
patching file net80211/ieee80211_node.h
patching file net80211/ieee80211_debug.h
patching file net80211/ieee80211_scan_sta.c
patching file net80211/ieee80211_wireless.c
patching file net80211/ieee80211_input.c
Hunk #8 FAILED at 1248.
Hunk #9 FAILED at 1291.
Hunk #10 succeeded at 1313 (offset 2 lines).
Hunk #11 FAILED at 1353.
Hunk #12 succeeded at 1365 (offset 3 lines).
Hunk #13 succeeded at 1391 (offset 3 lines).
Hunk #14 succeeded at 1401 (offset 3 lines).
Hunk #15 FAILED at 1480.
Hunk #16 succeeded at 1493 (offset 4 lines).
Hunk #17 succeeded at 1588 (offset 4 lines).
Hunk #18 succeeded at 2058 (offset 4 lines).
Hunk #19 succeeded at 2552 (offset 4 lines).
Hunk #20 FAILED at 2833.
Hunk #21 FAILED at 2946.
Hunk #22 succeeded at 2979 (offset 6 lines).
Hunk #23 succeeded at 3009 (offset 6 lines).
Hunk #24 succeeded at 3024 (offset 6 lines).
Hunk #25 succeeded at 3057 (offset 6 lines).
6 out of 25 hunks FAILED -- saving rejects to file net80211/ieee80211_input.c.rej
patching file net80211/ieee80211_output.c
patching file net80211/ieee80211_power.c
patching file net80211/ieee80211_var.h
patching file net80211/ieee80211_proto.c
patching file net80211/ieee80211_linux.h
patching file ath/if_ath.c

02/02/07 05:38:06 changed by mentor

02/02/07 10:08:02 changed by pommnitz@yahoo.com

Hello mentor, I'd like to give your branch a try, but I'm behind a proxy that won't allow the HTTP requests that WebDAV requires. Could you post a patch against the trunk or add the refcount branch to snapshots.madwifi.org?

Thanks in advance

Joerg

02/02/07 18:14:17 changed by mrenzmann

@Joerg: Seen this part of the Subversion FAQ already? Maybe it solves the proxy issue.

02/03/07 09:15:36 changed by pommnitz@yahoo.com

@mrenzmann: Yes, I know. HTTPS didn't work and our firewall doesn't allow to bypass the Proxy. Bringing something on a USB stick from home is a big no-no as well, so I'm stuck for now.

02/05/07 06:43:11 changed by mrenzmann

Would tarball downloads from http://snapshots.madwifi.org work for you?

02/05/07 09:07:41 changed by pommnitz@yahoo.com

@mrenzmann: Yes, that's what I meant with "add the refcount branch to snapshots.madwifi.org".

Thanks and happy start to the new week

Joerg

02/05/07 17:09:09 changed by mrenzmann

Ah, sorry, missed that part before :)

Tarballs for the refcount branch will now be generated automatically, download them from http://snapshots.madwifi.org/madwifi-ng-refcount. I generated the most recent tarball manually a few minutes ago.

02/07/07 12:54:19 changed by pommnitz@yahoo.com

@mentor: I could not test the refcount-branch (Yes! I finally cornered one of our admins and got him to adapt the proxy configuration.) because the implementation of ieee80211_msg_is_reported is missing.

Regards Joerg

02/07/07 13:42:43 changed by georg@boerde.de

  • attachment madwifi-refcnt-push-r2088.diff added.

cumulative diff to merge changes in HEAD up to r2088 into madwifi-ng-refcount

02/07/07 14:15:47 changed by pommnitz@yahoo.com

@georg@boerde.de: I looked over your patches, but it does not seem to fix the missing ieee80211_msg_is_reported, does it?

Grüße aus Berlin

Jörg

02/07/07 22:11:01 changed by mentor

TRy r2093?

02/08/07 11:35:37 changed by pommnitz@yahoo.com

@mentor: Well, ath_pci needs ieee80211_msg_is_reported as well, so you have to export it. With this change in place things look rather well. I just rebooted, so results are preliminary, but the refcount seems to be stable now in ad-hoc mode.

@all: would any committer take a look at my patch madwifi-0.9.2-jpo.diff? It restores the madwifi-old associated_sta format. We have a monitoring application that depends on these data.

Regards Joerg

02/08/07 15:49:32 changed by mentor

Dear God... it worked?!?

02/08/07 16:32:42 changed by georg@boerde.de

Hello again :)

having ieee80211_msg_is_reported as a module function instead of a macro has a performance penalty associated. Why not using the old macro which was already present in HEAD (and just called differently)?

#define ieee80211_msg_is_reported(_vap, _m) (_vap->iv_debug & (_m))

And you should add the diff from r2062 to the branch too, to make it complete.

02/09/07 15:28:08 changed by mentor

Because the function ieee80211_unref_node is defined before struct ieee80211vap. Thus _vap->iv_debug doesn't work because it has no idea what the members of _vap are. I should imagine I will end up moving all the headers around so this works, but for now...

02/13/07 10:46:48 changed by pommnitz@yahoo.com

@all: some more feedback So far things seem to be fine. The driver did not crash on me and the node refcount seems to be stable at reasonable levels. Unfortunately this is with two nodes only, but I'm pretty sure that I saw the problem in such a situation with 0.9.2. Sometimes next week I'll be able to perform tests with more nodes. I'll report about the results.

Regards

Joerg

02/14/07 09:36:04 changed by steffen@saxnet.de

We have the following setup running over 60h:

client --(wlan)-->NODE1(Bridge: ath0(master)+ath1(client))--(Wlan)-->NODE2-ath0(master)

"Client" was running iperf to NODE2. We had a troughput of about 3mbyte.

Client to NODE1 was running 2.4ghz and NODE1(ath1) to NODE2 was running 5.X ghz.

We used refcount 2099 for the test.

02/14/07 11:43:45 changed by mrenzmann

@steffen: NODE1 has two WLAN cards inside in the described scenario, not just one with two VAPs, right?

02/14/07 11:45:48 changed by steffen@saxnet.de

two cards, right

02/14/07 19:15:13 changed by mentor

My list of things to do before merging this is now:

  • Track down any lockups on driver removal (some have been reported)
  • Clean up headers, such that DEBUG_REFCNT works nicely
  • See if I can reproduce the Oops as per rozteck's friend

02/14/07 21:12:10 changed by mike.taylor@apprion.com

  • attachment madwifi-refcnt-push-r2115.diff added.

cumulative diff to merge changes in HEAD up to r2115 into madwifi-ng-refcount

02/14/07 21:14:37 changed by mike.taylor@apprion.com

  • attachment madwifi-refcnt.10.diff added.

This patch will upgrade a build from the trunk r2115 with all changes from the madwifi-ng-refcount branch, including the rollup of changes to 2115.

02/14/07 21:40:20 changed by mike.taylor@apprion.com

  • attachment madwifi-refcnt.11.diff added.

Same as 10, but adds ieee80211_debug.h which was missing in the previous patch.

02/16/07 11:21:14 changed by pommnitz@yahoo.com

Running refcount branch r2096 I have observed machine lockups that might indicate a memory leak. The lockups seem to happen after extended times of low traffic use (e.g. leaving two idle machines running olsrd in adhoc mode over night). The machine running the above mentioned Madwifi responds to console switches and the emergency hotkeys (e.g. CTR-ALT-SYSREQ xx) but the applications seem to be dead. It's difficult to get a trace because there is no persistent filesystem in RW mode mounted (only tmpfs, but with bounded size of 32MB by 512MB physical RAM). Networking seems to be dead as well, because the other machine doesn't show it as a neighbour anymore.

I don't point a finger yet at the refcount patches, but I'd like to hear from others.

02/16/07 11:23:16 changed by pommnitz@yahoo.com

Oh, and the slab statistics indicate that the 4096 byte cache grows continously. This might be harmless, but maybe it's not...

02/16/07 17:19:49 changed by pommnitz@yahoo.com

machine locked up again and CTRL-ALT-SYSRQ-M indicates that all memory is gone into the slab cache.

02/16/07 20:56:55 changed by mentor

Would you try with some debugging output enabled? Maybe using netconsole too?

Also, try a build with IEEE80211_DEBUG_REFCNT defined too?

Is it isolated to adhoc mode?

02/19/07 16:10:11 changed by pommnitz@yahoo.com

It really seems that the refcount patch is to blame. I have two machines running right now, the onliest difference is refcount vs. trunk in the madwifi driver. refcount is leaking memory, trunk is not. I'll add a serial console and enable IEEE80211_DEBUG_REFCNT next.

Regards Joerg

02/27/07 13:04:54 changed by rozteck@interia.pl

Today I give a try of madwifi-ng-refcount-r2123 and got the crash on driver reinit. The crash is as follows:

[   52.090000] Internal error: Oops - bad mode: 0 [#1]
[   52.090000] Modules linked in: sch_sfq sch_htb ipt_REJECT bridge llc tun iptable_filter iptable_nat ip_nat bonding e100 pcnet32 ppp_async crc_ccitt ppp_0
[   52.090000] CPU: 0
[   52.090000] PC is at 0xffff0014
[   52.090000] LR is at zz02dbf875+0x11c/0x26c [ath_hal]
[   52.090000] pc : [<ffff0014>]    lr : [<bf1ef554>]    Tainted: P
[   52.090000] sp : c3b17e68  ip : c3b17eb0  fp : c3b17ec8
[   52.090000] r10: c3b17f10  r9 : c3b17f78  r8 : 400a2978
[   52.090000] r7 : 00000000  r6 : c3784a68  r5 : c3738000  r4 : 00000000
[   52.090000] r3 : 00006208  r2 : 00000001  r1 : 0000a208  r0 : c4840000
[   52.090000] Flags: nZCv  IRQs off  FIQs on  Mode IRQ_32  Segment user
[   52.090000] Control: 39FF  Table: 034B0000  DAC: 00000015
[   52.090000] Process echo (pid: 2474, stack limit = 0xc3b16250)
[   52.090000] Stack: (0xc3b17e68 to 0xc3b18000)
[   52.090000] 7e60:                   c4840000 0000a208 00000001 00006208 00000000 c3738000
[   52.090000] 7e80: c3784a68 00000000 400a2978 c3b17f78 c3b17f10 c3b17ec8 c3b17eb0 c3b17e68
[   52.090000] 7ea0: bf1ef554 ffff0014 60000092 ffffffff c3b17f78 c3732280 c3738000 c3b17f00
[   52.090000] 7ec0: c3b17ecc bf21f0d8 bf1ef444 00000000 c3b17f78 00000000 c3784a68 c3b16000
[   52.090000] 7ee0: c372d900 ffffffff 00000001 00000002 c31835c0 c3b17f3c c3b17f04 c003d740
[   52.090000] 7f00: bf21ee84 c3b17f10 c3b17f78 400a2978 00000002 c31835c0 400a2978 c003d7ec
[   52.090000] 7f20: c3b16000 c3b17f78 c3b16000 400a08d0 c3b17f50 c3b17f40 c003d81c c003d668
[   52.090000] 7f40: c3b17f78 c3b17f74 c3b17f54 c00759ac c003d7f8 c31835e0 c31835c0 c3b17f78
[   52.090000] 7f60: 00000000 00000000 c3b17fa4 c3b17f78 c0075ad8 c00758f0 00000000 00000002
[   52.090000] 7f80: 00000000 00000000 be92feec 00000000 00000004 c001efc4 00000000 c3b17fa8
[   52.090000] 7fa0: c001ee20 c0075a98 00000000 be92feec 00000001 400a2978 00000002 00000002
[   52.090000] 7fc0: 00000000 be92feec 00000000 00000001 00000000 00000000 400a08d0 be92fd54
[   52.090000] 7fe0: be92fd58 be92fd34 4006998c 40090324 20000010 00000001 00000000 00000000
[   52.090000] Backtrace:
[   52.090000] [<bf1ef438>] (zz02dbf875+0x0/0x26c [ath_hal]) from [<bf21f0d8>] (ath_sysctl_halparam+0x260/0x530 [ath_pci])
[   52.090000]  r5 = C3738000  r4 = C3732280
[   52.090000] [<bf21ee78>] (ath_sysctl_halparam+0x0/0x530 [ath_pci]) from [<c003d740>] (do_rw_proc+0xe4/0x12c)
[   52.090000] [<c003d65c>] (do_rw_proc+0x0/0x12c) from [<c003d81c>] (proc_writesys+0x30/0x34)
[   52.090000] [<c003d7ec>] (proc_writesys+0x0/0x34) from [<c00759ac>] (vfs_write+0xc8/0x134)
[   52.090000] [<c00758e4>] (vfs_write+0x0/0x134) from [<c0075ad8>] (sys_write+0x4c/0x74)
[   52.090000]  r8 = 00000000  r7 = 00000000  r6 = C3B17F78  r5 = C31835C0
[   52.090000]  r4 = C31835E0
[   52.090000] [<c0075a8c>] (sys_write+0x0/0x74) from [<c001ee20>] (ret_fast_syscall+0x0/0x2c)
[   52.090000]  r8 = C001EFC4  r7 = 00000004  r6 = 00000000  r5 = BE92FEEC
[   52.090000]  r4 = 00000000
[   52.090000] Code: ea0000dd e59ff410 ea0000bb ea00009a (ea0000fa)
[   52.090000]  <0>Kernel panic - not syncing: Fatal exception
[   52.380000] [<c0022ba0>] (dump_stack+0x0/0x14) from [<c0035dbc>] (panic+0x54/0x12c)
[   52.390000] [<c0035d68>] (panic+0x0/0x12c) from [<c0022e68>] (die+0x26c/0x2b8)
[   52.400000]  r3 = 00000001  r2 = C3B16000  r1 = C0207B78  r0 = C0194B98
[   52.410000] [<c0022bfc>] (die+0x0/0x2b8) from [<c002320c>] (bad_mode+0x4c/0x60)
[   52.410000] [<c00231c0>] (bad_mode+0x0/0x60) from [<bf1ef554>] (zz02dbf875+0x11c/0x26c [ath_hal])
[   52.420000]  r4 = C4840000

02/27/07 13:08:42 changed by steffen@saxnet.de

@rozteck@interia.pl

Can you please discribe your test environment? For example how much devices were in the bridge, which modes, how long was it working, was there traffic on the bridge and a lot more of information like that are important.

02/27/07 13:14:05 changed by rozteck@interia.pl

The device is Intel IXP425 platform with 2 atheros cards. ath0 is added to wds0 and ath1 is attached to bridge with eth0. ath1 is working in ap mode. There's no traffic at all on any of the interfaces. Another crash:

[  173.980000] Bad mode in data abort handler detected: mode IRQ_32
[  173.980000] Internal error: Oops - bad mode: 0 [#1]
[  173.980000] Modules linked in: sch_sfq sch_htb ipt_REJECT bridge llc tun iptable_filter iptable_nat ip_nat bonding e100 pcnet32 ppp_async crc_ccitt ppp_0
[  173.980000] CPU: 0
[  173.980000] PC is at 0xffff0014
[  173.980000] LR is at zz002db51c+0x44/0x3c8 [ath_hal]
[  173.980000] pc : [<ffff0014>]    lr : [<bf1f3f08>]    Tainted: P
[  173.980000] sp : c35dbd20  ip : c35dbd68  fp : c35dbd88
[  173.980000] r10: c3750000  r9 : 00000006  r8 : 00000000
[  173.980000] r7 : 00000000  r6 : c3750000  r5 : c3750000  r4 : c3752688
[  173.980000] r3 : 00005930  r2 : c3750000  r1 : 00009930  r0 : c4840000
[  173.980000] Flags: nzCv  IRQs off  FIQs on  Mode IRQ_32  Segment user
[  173.980000] Control: 39FF  Table: 03574000  DAC: 00000015
[  173.980000] Process ifconfig (pid: 3660, stack limit = 0xc35da250)
[  173.980000] Stack: (0xc35dbd20 to 0xc35dc000)
[  173.980000] bd20: c4840000 00009930 c3750000 00005930 c3752688 c3750000 c3750000 00000000
[  173.980000] bd40: 00000000 00000006 c3750000 c35dbd88 c35dbd68 c35dbd20 bf1f3f08 ffff0014
[  173.980000] bd60: 20000092 ffffffff 00000000 c3750000 c372f1a8 00000000 00000000 c35dbde0
[  173.980000] bd80: c35dbd8c bf1f0240 bf1f3ed0 c00711d8 00000000 00000000 00000000 00000001
[  173.980000] bda0: 00000000 c37504b4 00000000 00000000 00000000 c011dbf0 00000000 c372e280
[  173.980000] bdc0: c372f1a8 c3750000 00000000 c372ef4c c372e000 c35dbe18 c35dbde4 bf21c9bc
[  173.980000] bde0: bf1f001c c35dbdec c001edfc c0045a88 c372e000 00000000 c3b52000 c372e000
[  173.980000] be00: 00000000 00000000 c3b52000 c35dbe30 c35dbe1c c0123580 bf21c860 c3b52280
[  173.980000] be20: c372e280 c35dbe54 c35dbe34 bf1bd84c c0123528 c3b52000 00000000 00001002
[  173.980000] be40: 00000000 c35dbed0 c35dbe64 c35dbe58 bf1bd90c bf1bd7e4 c35dbe7c c35dbe68
[  173.980000] be60: c0123580 bf1bd904 c3b52000 00001043 c35dbe9c c35dbe80 c0125050 c0123528
[  173.980000] be80: ffffff9d 00000000 00000000 beccddac c35dbf08 c35dbea0 c01693b8 c0124ffc
[  173.980000] bea0: c35da000 00000000 00000000 00008914 10430000 00000029 00000028 0000000c
[  173.980000] bec0: 61746831 00000000 00000000 00000000 10430000 00000029 00000028 0000000c
[  173.980000] bee0: c34203a0 00008914 beccddac beccddac 00000000 c35da000 00000000 c35dbf18
[  173.980000] bf00: c35dbf0c c016a6e4 c01690e4 c35dbf38 c35dbf1c c0118ff8 c016a640 c34203a0
[  173.980000] bf20: ffffffe7 00008914 beccddac c35dbf58 c35dbf3c c0089644 c0118e1c c34203a0
[  173.980000] bf40: fffffff7 beccddac 00000003 c35dbf84 c35dbf5c c008997c c0089614 beccded8
[  173.980000] bf60: 00000000 c34203a0 fffffff7 00008914 00000036 c001efc4 c35dbfa4 c35dbf88
[  173.980000] bf80: c00899dc c00896a8 00000000 beccddac 0004e1d8 0005b6bc 00000000 c35dbfa8
[  173.980000] bfa0: c001ee20 c00899a8 beccddac 0004e1d8 00000003 00008914 beccddac beccddac
[  173.980000] bfc0: beccddac 0004e1d8 0005b6bc 00000003 00000004 beccdedc 00000000 beccdc9c
[  173.980000] bfe0: beccdca0 beccdc7c 4008b3cc 4008b33c 20000010 00000003 400a2374 400a2474
[  173.980000] Backtrace:
[  173.980000] [<bf1f3ec4>] (zz002db51c+0x0/0x3c8 [ath_hal]) from [<bf1f0240>] (zz0002dbd2+0x230/0xf90 [ath_hal])
[  173.980000]  r8 = 00000000  r7 = 00000000  r6 = C372F1A8  r5 = C3750000
[  173.980000]  r4 = 00000000
[  173.980000] [<bf1f0010>] (zz0002dbd2+0x0/0xf90 [ath_hal]) from [<bf21c9bc>] (ath_init+0x168/0x2d0 [ath_pci])
[  173.980000] [<bf21c854>] (ath_init+0x0/0x2d0 [ath_pci]) from [<c0123580>] (dev_open+0x64/0xc8)
[  173.980000] [<c012351c>] (dev_open+0x0/0xc8) from [<bf1bd84c>] (ieee80211_init+0x74/0x120 [wlan])
[  173.980000]  r5 = C372E280  r4 = C3B52280
[  173.980000] [<bf1bd7d8>] (ieee80211_init+0x0/0x120 [wlan]) from [<bf1bd90c>] (ieee80211_open+0x14/0x18 [wlan])
[  173.980000]  r8 = C35DBED0  r7 = 00000000  r6 = 00001002  r5 = 00000000
[  173.980000]  r4 = C3B52000
[  173.980000] [<bf1bd8f8>] (ieee80211_open+0x0/0x18 [wlan]) from [<c0123580>] (dev_open+0x64/0xc8)
[  173.980000] [<c012351c>] (dev_open+0x0/0xc8) from [<c0125050>] (dev_change_flags+0x60/0x128)
[  173.980000]  r5 = 00001043  r4 = C3B52000
[  173.980000] [<c0124ff0>] (dev_change_flags+0x0/0x128) from [<c01693b8>] (devinet_ioctl+0x2e0/0x684)
[  173.980000]  r7 = BECCDDAC  r6 = 00000000  r5 = 00000000  r4 = FFFFFF9D
[  173.980000] [<c01690d8>] (devinet_ioctl+0x0/0x684) from [<c016a6e4>] (inet_ioctl+0xb0/0xe4)
[  173.980000] [<c016a634>] (inet_ioctl+0x0/0xe4) from [<c0118ff8>] (sock_ioctl+0x1e8/0x240)
[  173.980000] [<c0118e10>] (sock_ioctl+0x0/0x240) from [<c0089644>] (do_ioctl+0x3c/0x94)
[  173.980000]  r7 = BECCDDAC  r6 = 00008914  r5 = FFFFFFE7  r4 = C34203A0
[  173.980000] [<c0089608>] (do_ioctl+0x0/0x94) from [<c008997c>] (vfs_ioctl+0x2e0/0x300)
[  173.980000]  r7 = 00000003  r6 = BECCDDAC  r5 = FFFFFFF7  r4 = C34203A0
[  173.980000] [<c008969c>] (vfs_ioctl+0x0/0x300) from [<c00899dc>] (sys_ioctl+0x40/0x5c)
[  173.980000]  r8 = C001EFC4  r7 = 00000036  r6 = 00008914  r5 = FFFFFFF7
[  173.980000]  r4 = C34203A0
[  173.980000] [<c008999c>] (sys_ioctl+0x0/0x5c) from [<c001ee20>] (ret_fast_syscall+0x0/0x2c)
[  173.980000]  r6 = 0005B6BC  r5 = 0004E1D8  r4 = BECCDDAC
[  173.980000] Code: ea0000dd e59ff410 ea0000bb ea00009a (ea0000fa)
[  173.980000]  <0>Kernel panic - not syncing: Fatal exception
[  174.560000] [<c0022ba0>] (dump_stack+0x0/0x14) from [<c0035dbc>] (panic+0x54/0x12c)
[  174.570000] [<c0035d68>] (panic+0x0/0x12c) from [<c0022e68>] (die+0x26c/0x2b8)
[  174.570000]  r3 = 00000001  r2 = C35DA000  r1 = C0207B78  r0 = C0194B98
[  174.580000] [<c0022bfc>] (die+0x0/0x2b8) from [<c002320c>] (bad_mode+0x4c/0x60)
[  174.590000] [<c00231c0>] (bad_mode+0x0/0x60) from [<bf1f3f08>] (zz002db51c+0x44/0x3c8 [ath_hal])
[  174.600000]  r4 = C4840000

02/27/07 13:28:17 changed by rozteck@interia.pl

I have removed the ath1 from bridge and it seems that solve the problem. But I got another one - I'm working with wds and it seems that the refcount patch brokes this functionality. I was using r1860 with refcount patch applied and wds was not working. Switched to clean r1860 and is working. Tried madwifi-ng-refcount-r2123 and the wds is not working too. Does anyone have wds working properly on madwifi with refcount patch?

03/06/07 09:16:44 changed by dyqith

Hi, Just wanted to check in for any updates on this ticket/branch.

Any specific items that I should look at so we can merge this to trunk ?

I tried it on a couple of devices I have and it works great (for now). I'm running mines with 2 atheros cards and bridging + wds with no problems so far. Will check on it in a few hours.

03/06/07 11:03:58 changed by mrenzmann

Merge should not be started before the 0.9.3 release is out. I intend to call the freeze later today and plan to get the release out by the end of next week (March 15 or 16).

03/06/07 15:15:18 changed by mentor

I must admit that both those oops from rozteck look highly HAL related.

As to task list before merge: * Cleanup DEBUG support * Stop memory leakage * Check WDS work * Check reports of lockup on module unload * Wait for 0.9.3 to release

03/06/07 22:03:33 changed by dyqith

Minor update: running refcnt branch on 11 nodes (all with 2 radios, bridged with wds -- each radio is an AP, with one radio with an extra STA vap)

No problems so far. No memory leaks as far as I can tell.

Will test some other wds configs and see if those have any problems.

03/07/07 09:03:10 changed by steffen@saxnet.de

We still dont have NO oops or some "magic foo" - all nodes with refcnt branch are working well with a lot of different setups.

The only strange thing is, that we have 100-150kbyte more throughput with svn1491. We have tryed this with more then 3 different devices, antennas, cards... Everytime we had three nodes, one with two cards AP&STA and the other two nodes as master and client. Then iperf was running trough.

03/08/07 21:06:31 changed by mentor

I've fixed (up to copmile testing) DEBUG stuff in r2189, and merged up to trunk HEAD in r2190.

I'm now fairly happy that WDS works.

I really would like more information from Joerg and Ge0rg on their respective issues.

03/08/07 22:24:21 changed by rozteck@interia.pl

I have WDS working but only when the data is transferred through link. When there's no communication for about minute, the node is being removed on one device and is never recreated again properly which makes the WDS doesn't work anymore. Does anyone such issue or is this another thing I broke with my patches?

03/08/07 22:27:05 changed by mentor

Ah. That may be inactivity timeout related. Would you get a debug log of that, and I'll have a look...?

03/27/07 00:24:35 changed by ddrake@brontes3d.com

I'm having issues in ticket #1200 where madwifi is contending freed locks and causing system hangs. It was suggested that I try this branch.

I checked out madwifi-ng-refcount r2190 and it built OK, however wlan.ko would not load due to unknown symbol ieee80211_find_node. I fixed that by #undef IEEE80211_DEBUG_REFCNT in ieee80211_debug.h. I also had to remove -Werror from build flags to get it to build at that point.

At this point the modules load fine. I establish a connection with wpa_supplicant and ping the outside world, it is working OK.

I then stop wpa_supplicant, and the system crashes right away. Again it seems to be contending a freed lock.

NMI watchdog detected LOCKUP on CPU 2
RIP: .text.lock.spinlock+0x22
Call trace:
ieee80211_remove_wds_addr+0x1e/0xa0
ieee80211_node_leave
ieee80211_node_table_reset
ieee80211_reset_bss
__ieee802111_newstate
ieee80211_newstate
ath_newstate
sock_sendmsg
ieee80211_new_state
ieee80211_ioctl_setmlme
[...]

03/27/07 00:44:07 changed by ddrake@brontes3d.com

I enabled lock debugging, frame unwinding, and serial console. Lock debugging didn't find anything, but here's the full trace:

NMI Watchdog detected LOCKUP on CPU 1                                                                                                                 
CPU 1                                                                                                                                                       
Modules linked in: wlan_wep wlan_scan_sta ath_rate_sample ath_pci wlan nvidia ath_hal                                                                       
Pid: 4755, comm: wpa_supplicant Tainted: P      2.6.18-brontes-r7 #1                                                                                        
RIP: 0010:[<ffffffff8020d1cc>]  [<ffffffff8020d1cc>] __delay+0xc/0x20                                                                                       
RSP: 0018:ffff81007db69a88  EFLAGS: 00000006                                                                                                                
RAX: 000000002d96d41e RBX: ffff81003e318fc8 RCX: 000000002d96d415                                                                                           
RDX: 0000000000000026 RSI: ffffffff80550834 RDI: 0000000000000001                                                                                           
RBP: ffff81007db69a88 R08: 0000000000000001 R09: 0000000000000001                                                                                           
R10: ffffffff885a9ac7 R11: 0000000000000000 R12: 0000000000000000                                                                                           
R13: 0000000012c0b4b4 R14: 0000000000000001 R15: ffff81003e318fc8                                                                                           
FS:  00002b701aacb6d0(0000) GS:ffff81004004f560(0000) knlGS:0000000000000000                                                                                
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b                                                                                                           
CR2: 00002b701a8bd450 CR3: 000000007c52f000 CR4: 00000000000006e0                                                                                           
Process wpa_supplicant (pid: 4755, threadinfo ffff81007db68000, task ffff8100403780c0)                                                                      
Stack:  ffff81007db69ab8 ffffffff80207a06 ffff81003e318fc8 0000000000000082                                                                                 
 ffff81003e1ed1ad ffff81003e318fc8 ffff81007db69ad8 ffffffff8026dbf9                                                                                        
 0000000000000009 ffff81003e318fb8 ffff81007db69b18 ffffffff885a9ac7                                                                                        
Call Trace:                                                                                                                                                 
 [<ffffffff80207a06>] _raw_spin_lock+0xc6/0x160                                                                                                             
 [<ffffffff8026dbf9>] _spin_lock_irqsave+0x39/0x50                                                                                                          
 [<ffffffff885a9ac7>] :wlan:ieee80211_remove_wds_addr+0x27/0xc0                                                                                             
 [<ffffffff885aa223>] :wlan:ieee80211_node_leave+0x33/0x330                                                                                                 
 [<ffffffff885aa8cc>] :wlan:ieee80211_node_table_reset+0x8c/0xc0                                                                                            
 [<ffffffff885aa92d>] :wlan:ieee80211_reset_bss+0x2d/0xf0                                                                                                   
 [<ffffffff885b0a8a>] :wlan:__ieee80211_newstate+0x16a/0x6d0                                                                                                
 [<ffffffff885b120b>] :wlan:ieee80211_newstate+0x21b/0x240                                                                                                  
 [<ffffffff885dadc7>] :ath_pci:ath_newstate+0x6a7/0x6e0                                                                                                     
 [<ffffffff885af676>] :wlan:ieee80211_new_state+0x46/0x70                                                                                                   
 [<ffffffff885b5c6c>] :wlan:ieee80211_ioctl_setmlme+0x11c/0x260                                                                                             
 [<ffffffff804bc073>] wireless_process_ioctl+0x2e3/0x3e0                                                                                                    
 [<ffffffff804b2086>] dev_ioctl+0x356/0x3b0                                                                                                                 
 [<ffffffff804a82c0>] sock_ioctl+0x240/0x270                                                                                                                
 [<ffffffff80246a51>] do_ioctl+0x31/0xa0                                                                                                                    
 [<ffffffff8023411b>] vfs_ioctl+0x2ab/0x2d0                                                                                                                 
 [<ffffffff80252c9a>] sys_ioctl+0x4a/0x80                                                                                                                   
 [<ffffffff80266476>] system_call+0x7e/0x83                                                                                                                 
 [<00002b701a954fc7>]                                                                                                                                       
                                                                                                                                                            
Code: 29 c8 48 39 f8 72 f5 c9 c3 90 90 90 90 90 90 90 90 90 90 90                                                                                           
console shuts up ...                                                                                                                                        
 <0>Kernel panic - not syncing: Aiee, killing interrupt handler!                                                                                            

This is reproducible every time wpa_supplicant is stopped.

03/30/07 21:49:10 changed by mike.taylor@apprion.com

I was able to reproduce this with a failed association between a Broadcom radio with different WEP settings than required:

Unable to handle kernel NULL pointer dereference at virtual address 000000d4

pgd = c0004000

[000000d4] *pgd=00000000

Internal error: Oops: 17 #1

Modules linked in: wlan_acl wlan_wep wlan_scan_ap ixp400_eth m41txx

hw_random_ix

p465 ixp400 ath_rate_sample ath_pci wlan ath_hal ip6t_multiport ip6t_physdev

ip6

t_state ip6_conntrack ip6t_mac ip6t_LOG ip6t_REJECT ip6table_filter

ip6table_man

gle ip6_tables ipv6 ipt_multiport ipt_REJECT ipt_REDIRECT ipt_MASQUERADE

ipt_mac

ipt_physdev ipt_limit ipt_iprange ipt_LOG iptable_mangle iptable_nat ip_conntra

ck iptable_filter ip_tables ebt_redirect ebt_mark_m ebt_pkttype ebt_snat

ebt_dna

t ebt_log ebt_vlan ebtable_broute ebtable_filter ebtable_nat ebtables bridge

net

link_dev 8021q

CPU: 0

PC is at ieee80211_remove_wds_addr+0x2c/0x120 [wlan]

LR is at ieee80211_node_leave+0x70/0x4ac [wlan]

pc : [<bf1167e4>] lr : [<bf117b34>] Tainted: PF

sp : c0225be0 ip : c0225c10 fp : c0225c0c

r10: c78d316d r9 : c78fe046 r8 : c7ede220

r7 : 60000013 r6 : c72c0220 r5 : 00000000 r4 : c78d316d

r3 : 00000011 r2 : 60000093 r1 : c78d316d r0 : 00000044

Flags: nZCv IRQs off FIQs on Mode SVC_32 Segment kernel

Control: 39FF Table: 04358000 DAC: 00000017

Process swapper (pid: 0, stack limit = 0xc02241a0)

Stack: (0xc0225be0 to 0xc0226000)

5be0: c7ede220 00000000 000000b0 c7f496c8 c78d3000 00000000 c72c0220 00000001

5c00: c0225c54 c0225c10 bf117b34 bf1167c4 c78fe046 c78d3000 c0225c54 c0225c28

5c20: bf10f52c bf115774 00000000 c72c0220 c78d3000 c78fe038 00000001 0000000d

5c40: c78fe046 c78d3000 c0225d64 c0225c58 bf111ca0 bf117ad0 c0225cb4 c0225c68

5c60: bf17bf04 bf17b390 c4744360 c7ede024 00000000 c7ede000 00000000 00000040

5c80: c0194bc8 00000000 00000000 c78d3000 00000000 00000001 00000001 00000080

5ca0: 00000000 c018742c 00000001 00000080 00000000 bf17bc7c c0225d3c c3818980

5cc0: 00000001 00000050 c78fe020 c7ede220 0000002a 000000b0 00000001 c0195088

5ce0: c0225d24 0000000e 00000040 00000008 0000001c 00000009 00000000 000000fa

5d00: c0187638 00000001 00000000 00000000 00000007 c78802e0 00000000 000000c2

5d20: c7184aa0 00000000 c0225d44 c0225d38 c0065ac0 c0066330 00000000 000000b0

5d40: 0000002a c78fe030 00000000 c78d3000 c78d3000 c72c0220 c0225da4 c0225d68

5d60: bf144f78 bf10fde0 00006fa1 bf17cfe0 c7ede220 00006fa1 c005aa50 c78d3000

5d80: c72c0220 c78fe030 00000000 c78d3000 c7860080 c78fe020 c0225e44 c0225da8

5da0: bf1152c8 bf144f38 00006fa1 00000012 c02809b8 00040000 c0225df4 c0225dc8

5dc0: c005afc0 c00592bc c3fa6800 00000001 000000b0 00000000 00040000 c72c0000

5de0: c7ede220 00006fa1 0000002a c0225df8 c005940c c005ae94 c413f0a0 c4139800

5e00: 0000004c 0000004d c4746d20 c4156c48 00057000 00000c20 00000002 c78d3000

5e20: 00000000 c72c0220 00006fa1 0000002a c7860080 ffffffff c0225e7c c0225e48

5e40: bf115614 bf113c88 c0225e7c c0225e58 bf1176b4 bf116c6c 00000000 00000000

5e60: c0286dc4 c0286df0 0000000a c7ede220 c0225edc c0225e80 bf14c33c bf115598

5e80: bf1481d0 bf0e6778 00000001 11ff6fb0 c0225eac 0000002a ffc00090 c4268000

5ea0: c7953890 c7ede000 00000001 11ff6fb0 c4268000 c7edf604 00000000 c0286dc4

5ec0: c0286df0 0000000a ffffffff c0286da0 c0225efc c0225ee0 c0075b9c bf14ba80

5ee0: ffffffff c0224000 00000100 00000001 c0225f24 c0225f00 c0075670 c0075b18

5f00: c0224000 00000000 10000000 c005b838 c0225f70 c0280580 c0225f3c c0225f28

5f20: c007574c c0075628 0000001c c0280c10 c0225f6c c0225f40 c005afbc c0075720

5f40: 00040000 c005b880 c0225fa4 0000001f 10000000 c005b838 60000013 00026118

5f60: c0225fcc c0225f70 c005940c c005ae94 c414b540 c42b6b00 00000000 00000000

5f80: c005b834 c0224000 c02899fc c029bf7c 000261a8 69054202 00026118 c0225fcc

5fa0: c0225fb8 c0225fb8 c005b8cc c005b838 60000013 ffffffff 00000000 c02800bc

5fc0: c0225ffc c0225fd0 c0008954 c005b894 c0008488 00000000 00000000 c0281518

5fe0: 00000000 000039fd c0281500 c0227010 00000000 c0226000 0000809c c00087f0

Backtrace:

[<bf1167b8>] (ieee80211_remove_wds_addr+0x0/0x120 [wlan]) from [<bf117b34>]

(iee

e80211_node_leave+0x70/0x4ac [wlan])

r7 = 00000001 r6 = C72C0220 r5 = 00000000 r4 = C78D3000

[<bf117ac4>] (ieee80211_node_leave+0x0/0x4ac [wlan]) from [<bf111ca0>]

(ieee8021

1_recv_mgmt+0x1ecc/0x3ea8 [wlan])

[<bf10fdd4>] (ieee80211_recv_mgmt+0x0/0x3ea8 [wlan]) from [<bf144f78>]

(ath_recv

_mgmt+0x4c/0x264 [ath_pci])

[<bf144f2c>] (ath_recv_mgmt+0x0/0x264 [ath_pci]) from [<bf1152c8>]

(ieee80211_in

put+0x164c/0x1910 [wlan])

[<bf113c7c>] (ieee80211_input+0x0/0x1910 [wlan]) from [<bf115614>]

(ieee80211_in

put_all+0x88/0x174 [wlan])

[<bf11558c>] (ieee80211_input_all+0x0/0x174 [wlan]) from [<bf14c33c>]

(ath_rx_ta

sklet+0x8c8/0xa40 [ath_pci])

[<bf14ba74>] (ath_rx_tasklet+0x0/0xa40 [ath_pci]) from [<c0075b9c>]

(tasklet_act

ion+0x90/0xe4)

[<c0075b0c>] (tasklet_action+0x0/0xe4) from [<c0075670>]

(_do_softirq+0x54/0xf

8)

r6 = 00000001 r5 = 00000100 r4 = C0224000

[<c007561c>] (_do_softirq+0x0/0xf8) from [<c007574c>]

(do_softirq+0x38/0x58)

[<c0075714>] (do_softirq+0x0/0x58) from [<c005afbc>] (asm_do_IRQ+0x134/0x158)

r5 = C0280C10 r4 = 0000001C

[<c005ae88>] (asm_do_IRQ+0x0/0x158) from [<c005940c>] (irq_svc+0x2c/0xa0)

[<c005b888>] (cpu_idle+0x0/0x60) from [<c0008954>] (start_kernel+0x170/0x1b4)

r5 = C02800BC r4 = 00000000

[<c00087e4>] (start_kernel+0x0/0x1b4) from [<0000809c>] (0x809c)

Code: e10f7000 e3872080 e121f002 e0800103 (e5905090)

<0>Kernel panic - not syncing: Aiee, killing interrupt handler!

03/30/07 21:49:56 changed by mike.taylor@apprion.com

Sorry about the formatting. My bad.

04/04/07 23:22:36 changed by mike.taylor@apprion.com

I ran into a circular dependency with the placement of IEEE80211_DEBUG_REFCNT definition.

ieee80211_node.h defines static inline function, ieee80211_unref_node_debug. This function requires ieee80211_debug.h. ieee80211_debug.h requires struct ieee80211vap, which requires ieee80211_node.h.

The current code in the branch only works because IEEE80211_DEBUG_REFCNT is not defined when ieee80211_node.h is included before ieee80211_var.h and there are no uses of unref node at that point.

Any time ieee80211_node.h was included before (or without) ieee80211_var.h you would get mixed debug and non-debug unref code. After merging with the trunk, and a few other patches, at least one situation arose where this circular dependency bit me in if_ath.c.

Whenever this situation should occur, a broken build results with undefined references to ieee80211_find_node, and ieee80211_find_txnode in wlan.ko (possibly more).

The simplest solution is to move ieee80211_debug.h up to the top of ieee80211_var.h (before ieee80211_node.h) and then move ieee80211_unref_node_debug from ieee80211_node.h to ieee80211_node.c.

I'll post a signed patch attached in a minute.

04/04/07 23:23:59 changed by mike.taylor@apprion.com

Update: Actually this should be happing in the current code base. We should be seeing inline definitions of both functions.

04/04/07 23:56:44 changed by mike.taylor@apprion.com

  • attachment madwifi-refcount-ng-debug-dependency-cycle.diff added.

Signed patch describing/resolving cycle in ieee 80211 debugging code

04/05/07 00:08:38 changed by mike.taylor@apprion.com

madwifi-refcount-ng-debug-dependency-cycle.diff patches three files. I'm not sure why the trac system is not showing net80211/ieee80211_node.h changes.

04/05/07 00:48:57 changed by mentor

Note r2251, r2254, and r2255.

I know about the problems in the headers; I'm just not entirely sure about the solution to it. My header-fu is lacking on the subject. However, I don't think your solution is the correct one.

04/05/07 01:08:05 changed by mike.taylor@apprion.com

I don't disagree. Thanks for the heads up on the merge. I will throw my hack away.

04/08/07 18:55:51 changed by mentor

Do the latest revisions help? I have a card to test with now, but I'm hitting other crashes.

04/09/07 17:47:30 changed by ddrake@brontes3d.com

r2256 of the branch solves the issues I was experiencing earlier (compilation problems, immediate failure on wpa_supplicant shutdown). I'm starting some stress testing now in hope that this branch solves #1200...

(follow-up: ↓ 91 ) 04/09/07 22:27:56 changed by newuser

Hi, I am relatively new to the madwifi group. I just switched from madwifi-old to ng, and noticed the refcount problem. I was trying the latest tar from refcount branch on a routerboard 230. And the routerboard crashes as soon as the module is loaded, with almost no debugging information. I was wondering if someone else on this list experienced the same problem - and any solutions would be greatly appreciated.

04/09/07 22:29:41 changed by newuser

Oh btw the card is a atheros mini pci card from netgate.

04/09/07 22:47:10 changed by newuser

Hi Again , To get a better idea - i first tried just modifying free -> unref on the 0.9.3 source that part worked. But changing from node allocation to node_table allocation and from node locks to node table locks is causing it to crash. Unfortunately I am not able to get more specific information :(.. Once again thanks in advance for any help in this regard.

(in reply to: ↑ 88 ) 04/10/07 11:13:38 changed by mrenzmann

Replying to newuser:

And the routerboard crashes as soon as the module is loaded, with almost no debugging information.

Maybe DevDocs/KernelOops helps to obtain a dump of the oops message in case there is any.

05/07/07 16:51:13 changed by ddrake@brontes3d.com

I've now been running my test script against the refcount branch extensively, with no crashes. This branch appears to fix ticket #1200. Thanks!

05/21/07 20:45:42 changed by mentor

  • status changed from assigned to closed.
  • resolution set to fixed.
  • milestone set to version 0.9.4.

Well, I've merged this branch to trunk, so any problems should be opened in new tickets.

r2357

02/11/08 06:12:56 changed by mrenzmann

  • milestone changed from version 0.9.4 to version 0.9.5.