Please note: This project is no longer active. The website is kept online for historic purposes only.
If you´re looking for a Linux driver for your Atheros WLAN device, you should continue here .

Ticket #900 (new defect)

Opened 13 years ago

Last modified 12 years ago

HAL_INT_MIB is never cleared in ath_intr()

Reported by: valins@soften.ktu.lt Assigned to:
Priority: major Milestone:
Component: madwifi: HAL Version: trunk
Keywords: Cc: valins@soften.ktu.lt
Patch is attached: 0 Pending:

Description

Sometimes ath_intr() failes to clear HAL_INT_MIB. This then results in endless irq floods which is about 50K irqs/s. ath_intr() leaves and is invoked immediately again. This leads to a soft-lockup, if kernel compiled with soft-lockup detection - such lockup is detected (no progress within 10s interval made, linux 2.6.18-rc{2,5,6,7}, Xscale IXDP425, 233Mhz, Big Endian).

Doing sysrq-P shows the same backtrace, sometimes driver spends time under ath_hal.ko, sometimes in ath_hal_mibevent(). I can provide backtraces obtained this way. The only solution to cure this problem I found - was invoke ath_reset() via tasklet (as in HAL_INT_FATAL case). To iliustrate the problem I've come up with the following debug/solution patch (see attached to the ticket).

The output is the following now:

[ 5566.752000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 5566.760000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 5566.768000] ath_intr(): MIB retry count: 3, status: 0x1041, mibstatus: 0x1001000
[ 5566.776000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 5566.780000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 5566.788000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 5566.796000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 5566.804000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 5566.812000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 5566.816000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 5566.824000] count: 5, saved: 1001009, masked: 1001
[ 5566.832000] wifi0: hardware error; reseting
[ 6357.772000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 6357.780000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 6357.788000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 6357.796000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 6357.800000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 6357.808000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 6357.816000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 6357.824000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 6357.832000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 6357.836000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 6357.844000] count: 5, saved: 1001000, masked: 1000
[ 6357.848000] wifi0: hardware error; reseting

Now device survives stress tests, without it - it locks up and reports :

[  667.448000] BUG: soft lockup detected on CPU#0!
[  667.448000] [<c001f380>] (dump_stack+0x0/0x14) from [<c0055838>] (softlockup_tick+0x94/0xc4)
[  667.448000] [<c00557a4>] (softlockup_tick+0x0/0xc4) from [<c00416c0>] (run_local_timers+0x18/0x1c)
[  667.448000]  r7 = C360BD64  r6 = FFBF0FFF  r5 = 00000000  r4 = C0393AC0
[  667.448000] [<c00416a8>] (run_local_timers+0x0/0x1c) from [<c00416f8>] (update_process_times+0x34/0x70)
[  667.448000] [<c00416c4>] (update_process_times+0x0/0x70) from [<c001e95c>] (timer_tick+0xc8/0xf8)
[  667.448000]  r5 = 000411A8  r4 = C360BD64 
[  667.448000] [<c001e894>] (timer_tick+0x0/0xf8) from [<c002596c>] (ixp4xx_timer_interrupt+0x60/0xa4)
[  667.448000]  r6 = FFBF0FFF  r5 = 000411A8  r4 = C0201ABC 
[  667.448000] [<c002590c>] (ixp4xx_timer_interrupt+0x0/0xa4) from [<c0055c08>] (handle_IRQ_event+0x4c/0x8c)
[  667.448000]  r8 = 00000000  r7 = 00000005  r6 = C360BD64  r5 = 00000000
[  667.448000]  r4 = C01C0B80 
[  667.448000] [<c0055bbc>] (handle_IRQ_event+0x0/0x8c) from [<c00574c0>] (handle_level_irq+0x88/0x110)
[  667.448000]  r8 = 00000000  r7 = C02D0000  r6 = C360BD64  r5 = 00000005
[  667.448000]  r4 = C01BE140 
[  667.448000] [<c0057438>] (handle_level_irq+0x0/0x110) from [<c001b710>] (asm_do_IRQ+0x40/0x50)
[  667.448000]  r6 = 00000020  r5 = 0000001F  r4 = C01BE140 
[  667.448000] [<c001b6d0>] (asm_do_IRQ+0x0/0x50) from [<c001a9e4>] (__irq_svc+0x24/0x60)
[  667.448000]  r4 = FFFFFFFF 
[  667.448000] [<bf0b22f8>] (ath_hal_reg_write+0x0/0x84 [ath_hal]) from [<bf0cc790>] (zz05b8c65a+0x34/0x12c [ath_hal]
)
[  667.448000] [<bf0cc75c>] (zz05b8c65a+0x0/0x12c [ath_hal]) from [<bf13f104>] (ath_intr+0x1d4/0xcd4 [ath_pci])
[  667.448000]  r7 = C02D0000  r6 = C34A2280  r5 = 00000000  r4 = C34A2000
[  667.448000] [<bf13ef30>] (ath_intr+0x0/0xcd4 [ath_pci]) from [<c0055c08>] (handle_IRQ_event+0x4c/0x8c)
[  667.448000] [<c0055bbc>] (handle_IRQ_event+0x0/0x8c) from [<c00574c0>] (handle_level_irq+0x88/0x110)
[  667.448000]  r8 = 00BFFF38  r7 = C02D2CA8  r6 = C360BE94  r5 = 0000001C
[  667.448000]  r4 = C01BE700 
[  667.448000] [<c0057438>] (handle_level_irq+0x0/0x110) from [<c001b710>] (asm_do_IRQ+0x40/0x50)
[  667.448000]  r6 = 10000000  r5 = 0000001F  r4 = C01BE700 
[  667.448000] [<c001b6d0>] (asm_do_IRQ+0x0/0x50) from [<c001a9e4>] (__irq_svc+0x24/0x60)
[  667.448000]  r4 = FFFFFFFF 
[  667.448000] [<bf0ccac4>] (zz002daf00+0x0/0x2c8 [ath_hal]) from [<bf13ea38>] (ath_rx_tasklet+0x414/0x90c [ath_pci])
[  667.448000]  r8 = C3EE1000  r7 = 00000008  r6 = C3957EF0  r5 = FFC006C0
[  667.448000]  r4 = C34A2280 
[  667.448000] [<bf13e624>] (ath_rx_tasklet+0x0/0x90c [ath_pci]) from [<c003c664>] (tasklet_action+0x74/0xe8)
[  667.448000] [<c003c5f0>] (tasklet_action+0x0/0xe8) from [<c003c528>] (__do_softirq+0x70/0xe0)
[  667.448000]  r6 = 00000009  r5 = C0206E70  r4 = 00000001 
[  667.448000] [<c003c4b8>] (__do_softirq+0x0/0xe0) from [<c003c8e4>] (irq_exit+0x48/0x50)
[  667.448000] [<c003c89c>] (irq_exit+0x0/0x50) from [<c001b714>] (asm_do_IRQ+0x44/0x50)
[  667.448000] [<c001b6d0>] (asm_do_IRQ+0x0/0x50) from [<c001aba0>] (__irq_usr+0x40/0x80)

or

 [  538.104000] BUG: soft lockup detected on CPU#0!
[  538.104000] [<c001f380>] (dump_stack+0x0/0x14) from [<c0055838>] (softlockup_tick+0x94/0xc4)
[  538.104000] [<c00557a4>] (softlockup_tick+0x0/0xc4) from [<c00416c0>] (run_local_timers+0x18/0x1c)
[  538.104000]  r7 = C37BF6F4  r6 = FFBF0FFF  r5 = 00000000  r4 = C030D580
[  538.104000] [<c00416a8>] (run_local_timers+0x0/0x1c) from [<c00416f8>] (update_process_times+0x34/0x70)
[  538.104000] [<c00416c4>] (update_process_times+0x0/0x70) from [<c001e95c>] (timer_tick+0xc8/0xf8)
[  538.104000]  r5 = 000411A8  r4 = C37BF6F4 
[  538.104000] [<c001e894>] (timer_tick+0x0/0xf8) from [<c002596c>] (ixp4xx_timer_interrupt+0x60/0xa4)
[  538.104000]  r6 = FFBF0FFF  r5 = 000411A8  r4 = C0201ABC 
[  538.104000] [<c002590c>] (ixp4xx_timer_interrupt+0x0/0xa4) from [<c0055c08>] (handle_IRQ_event+0x4c/0x8c)
[  538.104000]  r8 = 00000000  r7 = 00000005  r6 = C37BF6F4  r5 = 00000000
[  538.104000]  r4 = C01C0B80 
[  538.104000] [<c0055bbc>] (handle_IRQ_event+0x0/0x8c) from [<c00574c0>] (handle_level_irq+0x88/0x110)
[  538.104000]  r8 = 0000E26C  r7 = 0000001C  r6 = C37BF6F4  r5 = 00000005
[  538.104000]  r4 = C01BE140 
[  538.104000] [<c0057438>] (handle_level_irq+0x0/0x110) from [<c001b710>] (asm_do_IRQ+0x40/0x50)
[  538.104000]  r6 = 00000020  r5 = 0000001F  r4 = C01BE140 
[  538.104000] [<c001b6d0>] (asm_do_IRQ+0x0/0x50) from [<c001a9e4>] (__irq_svc+0x24/0x60)
[  538.104000]  r4 = FFFFFFFF 
[  538.104000] [<c0055bbc>] (handle_IRQ_event+0x0/0x8c) from [<c00574c0>] (handle_level_irq+0x88/0x110)
[  538.104000]  r8 = 0000E26C  r7 = 00010484  r6 = C37BF790  r5 = 0000001C
[  538.104000]  r4 = C01BE700 
[  538.104000] [<c0057438>] (handle_level_irq+0x0/0x110) from [<c001b710>] (asm_do_IRQ+0x40/0x50)
[  538.104000]  r6 = 10000020  r5 = 0000001F  r4 = C01BE700 
[  538.104000] [<c001b6d0>] (asm_do_IRQ+0x0/0x50) from [<c001a9e4>] (__irq_svc+0x24/0x60)
[  538.104000]  r4 = FFFFFFFF 
[  538.104000] [<c012faac>] (skb_copy_bits+0x0/0x21c) from [<c01318b0>] (skb_copy+0xbc/0x11c)
[  538.104000] [<c01317f4>] (skb_copy+0x0/0x11c) from [<bf0f6034>] (ieee80211_encap+0x434/0x14e4 [wlan])
[  538.104000]  r8 = C385B280  r7 = C307CB74  r6 = 00000001  r5 = 0000004C
[  538.104000]  r4 = 00000000 
[  538.104000] [<bf0f5c00>] (ieee80211_encap+0x0/0x14e4 [wlan]) from [<bf1401f8>] (ath_hardstart+0x5f4/0x1204 [ath_pc
i])
[  538.104000] [<bf13fc04>] (ath_hardstart+0x0/0x1204 [ath_pci]) from [<c0135dc8>] (dev_hard_start_xmit+0xd8/0x270)
[  538.104000] [<c0135cf0>] (dev_hard_start_xmit+0x0/0x270) from [<c0146034>] (__qdisc_run+0x7c/0x198)
[  538.104000]  r8 = C309685E  r7 = C3C2B0A0  r6 = C307CB74  r5 = 00000000
[  538.104000]  r4 = C34EE000 
[  538.104000] [<c0145fb8>] (__qdisc_run+0x0/0x198) from [<c0138238>] (dev_queue_xmit+0x148/0x240)
[  538.104000]  r7 = C307CBA4  r6 = C307CB74  r5 = 00000000  r4 = C34EE000
[  538.104000] [<c01380f0>] (dev_queue_xmit+0x0/0x240) from [<bf0f5458>] (ieee80211_parent_queue_xmit+0x54/0x5c [wlan
])
[  538.104000]  r6 = C307CB74  r5 = C385B280  r4 = C307CB74 
[  538.104000] [<bf0f5404>] (ieee80211_parent_queue_xmit+0x0/0x5c [wlan]) from [<bf0f55f0>] (ieee80211_hardstart+0x19
0/0x3cc [wlan])
[  538.104000]  r4 = C3EC7000 
[  538.104000] [<bf0f5460>] (ieee80211_hardstart+0x0/0x3cc [wlan]) from [<c0135dc8>] (dev_hard_start_xmit+0xd8/0x270)
[  538.104000]  r8 = 00000000  r7 = C385B000  r6 = C307CB74  r5 = C35C0E60
[  538.104000]  r4 = C385B000 
[  538.104000] [<c0135cf0>] (dev_hard_start_xmit+0x0/0x270) from [<c01382a8>] (dev_queue_xmit+0x1b8/0x240)
[  538.104000]  r8 = 00000000  r7 = 00000000  r6 = C307CB74  r5 = C35C0E60
[  538.104000]  r4 = C385B000 
[  538.104000] [<c01380f0>] (dev_queue_xmit+0x0/0x240) from [<c0154e1c>] (ip_output+0x104/0x210)
[  538.104000]  r6 = C362C7A0  r5 = C35C0E60  r4 = C307CB74 
[  538.104000] [<c0154d18>] (ip_output+0x0/0x210) from [<c01556ac>] (ip_queue_xmit+0x1cc/0x43c)
[  538.104000]  r7 = 00000000  r6 = C3E7B140  r5 = C307CB74  r4 = C309686C
[  538.104000] [<c01554e0>] (ip_queue_xmit+0x0/0x43c) from [<c0165dd8>] (tcp_transmit_skb+0x390/0x770)
[  538.104000] [<c0165a48>] (tcp_transmit_skb+0x0/0x770) from [<c0167a6c>] (__tcp_push_pending_frames+0x1f0/0x914)
[  538.104000] [<c016787c>] (__tcp_push_pending_frames+0x0/0x914) from [<c0164e48>] (tcp_rcv_established+0x440/0x788)
[  538.104000] [<c0164a08>] (tcp_rcv_established+0x0/0x788) from [<c016b5d4>] (tcp_v4_do_rcv+0xc8/0x318)
[  538.104000] [<c016b50c>] (tcp_v4_do_rcv+0x0/0x318) from [<c016dd90>] (tcp_v4_rcv+0x834/0x92c)
[  538.104000] [<c016d55c>] (tcp_v4_rcv+0x0/0x92c) from [<c01502d4>] (ip_local_deliver+0xc4/0x188)
[  538.104000] [<c0150210>] (ip_local_deliver+0x0/0x188) from [<c0150740>] (ip_rcv+0x298/0x468)
[  538.104000]  r8 = C385B000  r7 = C395F4B0  r6 = C020EE6C  r5 = C395F480
[  538.104000]  r4 = C362B044 
[  538.104000] [<c01504a8>] (ip_rcv+0x0/0x468) from [<c0135a44>] (netif_receive_skb+0x318/0x3e0)
[  538.104000]  r8 = C385B000  r7 = 00000800  r6 = C020DDEC  r5 = C01CFF50
[  538.104000]  r4 = C020DDD4 
[  538.104000] [<c013572c>] (netif_receive_skb+0x0/0x3e0) from [<c013765c>] (process_backlog+0xa4/0x184)
[  538.104000]  r8 = C020DB38  r7 = C020DB54  r6 = 00000000  r5 = C020DB3C
[  538.104000]  r4 = C385B000 
[  538.104000] [<c01375b8>] (process_backlog+0x0/0x184) from [<c0137944>] (net_rx_action+0xb4/0x1b0)
[  538.104000] [<c0137890>] (net_rx_action+0x0/0x1b0) from [<c003c528>] (__do_softirq+0x70/0xe0)
[  538.104000] [<c003c4b8>] (__do_softirq+0x0/0xe0) from [<c003c5e8>] (do_softirq+0x50/0x58)
[  538.104000] [<c003c598>] (do_softirq+0x0/0x58) from [<c003c814>] (local_bh_enable+0x80/0xac)
[  538.104000]  r4 = C37BE000 
[  538.104000] [<c003c794>] (local_bh_enable+0x0/0xac) from [<c012c738>] (release_sock+0x8c/0x90)
[  538.104000]  r4 = 00000000 
[  538.104000] [<c012c6ac>] (release_sock+0x0/0x90) from [<c015cd60>] (tcp_recvmsg+0x2f0/0x84c)
[  538.104000]  r6 = 00000000  r5 = 0D04C2CF  r4 = C3E7B194 
[  538.104000] [<c015ca70>] (tcp_recvmsg+0x0/0x84c) from [<c012c0dc>] (sock_common_recvmsg+0x48/0x5c)
[  538.104000] [<c012c094>] (sock_common_recvmsg+0x0/0x5c) from [<c0129924>] (do_sock_read+0xd0/0xd4)
[  538.104000]  r6 = C36A24A0  r5 = 00008000  r4 = 00000040 
[  538.104000] [<c0129854>] (do_sock_read+0x0/0xd4) from [<c012a17c>] (sock_aio_read+0x74/0x7c)
[  538.104000]  r7 = BEF1BAC0  r6 = 00008000  r5 = C3B16480  r4 = C37BFEC0
[  538.104000] [<c012a10c>] (sock_aio_read+0x4/0x7c) from [<c00748b4>] (do_sync_read+0xc4/0x110)
[  538.104000]  r4 = C37BFEC0 
[  538.104000] [<c00747f0>] (do_sync_read+0x0/0x110) from [<c00753f4>] (vfs_read+0x13c/0x144)
[  538.104000]  r8 = C37BE000  r7 = C37BFF80  r6 = 00008000  r5 = BEF1BAC0
[  538.104000]  r4 = C3B16480 
[  538.104000] [<c00752b8>] (vfs_read+0x0/0x144) from [<c0075874>] (sys_read+0x4c/0x84)
[  538.104000]  r8 = C001AF24  r7 = 00000003  r6 = 00000000  r5 = 00000000
[  538.104000]  r4 = C3B16480 
[  538.104000] [<c0075828>] (sys_read+0x0/0x84) from [<c001ad80>] (ret_fast_syscall+0x0/0x2c)
[  538.104000]  r6 = 00000001  r5 = 00000060  r4 = 00000004 

Attachments

ath-mib-irq-status-debug.diff (60.0 kB) - added by anonymous on 09/20/06 15:51:56.
ath-mib-irq-status-debug-2.diff (6.9 kB) - added by valins@soften.ktu.lt on 09/20/06 16:20:42.
Second try, please see attached patch.
ath-mib-irq-status-mini.diff (1.9 kB) - added by valins@soften.ktu.lt on 09/20/06 17:21:49.
Reset radio if ath_hal_mibevent() fails to reset HAL_INT_MIB flag. Simpler way to solve the soft-lock up problem.
traffic.pdf (2.0 MB) - added by mrenzmann on 09/25/06 10:24:51.
Traffic measurement results
ath-mib-irq-status-mini.2.diff (2.5 kB) - added by Žilvinas Valinskas <valins@soften.ktu.lt> on 09/25/06 17:01:00.
Signed-off-by: Žilvinas Valinskas <valins@soften.ktu.lt>

Change History

09/20/06 15:51:56 changed by anonymous

  • attachment ath-mib-irq-status-debug.diff added.

09/20/06 15:57:57 changed by mrenzmann

  • milestone deleted.

Can you please rediff your patch, avoiding the tons of whitespace-changes? Thanks.

09/20/06 16:07:04 changed by anonymous

Sorry about ath-mib-irq-status-debug.diff, it seems very noisy and big change, when in reality it is all due to trailing whitespaces. Regenerating patch again ...

It seems as if ath_hal_mibevent() reads MIB (counters?) and this should reset HAL_INT_MIB in interrupt status register, but appearently this doesn't. Perhaps, there are X such registers but there fewer registers read - as a result HAL_INT_MIB will not be cleared ? For now ath_reset() seems the only solution.

09/20/06 16:20:42 changed by valins@soften.ktu.lt

  • attachment ath-mib-irq-status-debug-2.diff added.

Second try, please see attached patch.

09/20/06 17:02:50 changed by anonymous

Some results, madwifi-ng under stress testing :

[  551.708000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001040
[  551.716000] ath_intr(): mibstatus: 0x1001040, HAL_INT_MIB is not cleared.
[  551.724000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[  551.732000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[  551.736000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[  551.744000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[  551.752000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[  551.760000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[  551.768000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[  551.772000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[  551.780000] count: 5, saved: 1001000, masked: 1000
[  551.788000] wifi0: hardware error; reseting
[ 1588.276000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 1588.284000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 1588.288000] ath_intr(): MIB retry count: 3, status: 0x1041, mibstatus: 0x1001000
[ 1588.296000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 1588.304000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 1588.312000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 1588.316000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 1588.324000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 1588.332000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 1588.340000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 1588.348000] count: 5, saved: 1001009, masked: 1001
[ 1588.352000] wifi0: hardware error; reseting
[ 2375.448000] ath_bmiss_tasklet
[ 2376.100000] ath_bmiss_tasklet
[ 2376.808000] ath_bmiss_tasklet
[ 2377.516000] ath_bmiss_tasklet
[ 2378.228000] ath_bmiss_tasklet
[ 2378.936000] ath_bmiss_tasklet
[ 2379.648000] ath_bmiss_tasklet
[ 2380.356000] ath_bmiss_tasklet
[ 2837.400000] ath_uapsd_processtriggers: Node (00:0b:6b:4d:53:0f) no longer U-APSD triggerable (65535) [ 3363.948000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 3363.956000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 3363.960000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 3363.968000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 3363.976000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 3363.984000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 3363.992000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 3363.996000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 3364.004000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 3364.012000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 3364.020000] count: 5, saved: 1001000, masked: 1000
[ 3364.024000] wifi0: hardware error; reseting
[ 3866.120000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 3866.128000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 3866.136000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 3866.144000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 3866.152000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 3866.156000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 3866.164000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 3866.172000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 3866.180000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 3866.188000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 3866.192000] count: 5, saved: 1001000, masked: 1000
[ 3866.200000] wifi0: hardware error; reseting
[ 4782.132000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 4782.140000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 4782.144000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 4782.152000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 4782.160000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 4782.168000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 4782.176000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 4782.180000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 4782.188000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 4782.196000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 4782.204000] count: 5, saved: 1001000, masked: 1000
[ 4782.208000] wifi0: hardware error; reseting
[ 5566.752000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 5566.760000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 5566.768000] ath_intr(): MIB retry count: 3, status: 0x1041, mibstatus: 0x1001000
[ 5566.776000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 5566.780000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 5566.788000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 5566.796000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 5566.804000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 5566.812000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 5566.816000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 5566.824000] count: 5, saved: 1001009, masked: 1001
[ 5566.832000] wifi0: hardware error; reseting
[ 6357.772000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 6357.780000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 6357.788000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 6357.796000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 6357.800000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 6357.808000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 6357.816000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 6357.824000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 6357.832000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 6357.836000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 6357.844000] count: 5, saved: 1001000, masked: 1000
[ 6357.848000] wifi0: hardware error; reseting
[ 7146.224000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001040
[ 7146.232000] ath_intr(): mibstatus: 0x1001040, HAL_INT_MIB is not cleared.
[ 7146.240000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 7146.248000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 7146.256000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 7146.264000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 7146.268000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 7146.276000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 7146.284000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 7146.292000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 7146.296000] count: 5, saved: 1001000, masked: 1000
[ 7146.304000] wifi0: hardware error; reseting
[ 8417.564000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001040
[ 8417.572000] ath_intr(): mibstatus: 0x1001040, HAL_INT_MIB is not cleared.
[ 8417.580000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 8417.588000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 8417.596000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 8417.600000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 8417.608000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 8417.616000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 8417.624000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 8417.632000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 8417.636000] count: 5, saved: 1001000, masked: 1000
[ 8417.644000] wifi0: hardware error; reseting
[ 9205.008000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001009
[ 9205.016000] ath_intr(): mibstatus: 0x1001009, HAL_INT_MIB is not cleared.
[ 9205.024000] ath_intr(): MIB retry count: 3, status: 0x1041, mibstatus: 0x1001000
[ 9205.032000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 9205.040000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 9205.044000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 9205.052000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 9205.060000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 9205.068000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 9205.076000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 9205.080000] count: 5, saved: 1001000, masked: 1000
[ 9205.088000] wifi0: hardware error; reseting
[ 9990.852000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001040
[ 9990.860000] ath_intr(): mibstatus: 0x1001040, HAL_INT_MIB is not cleared.
[ 9990.868000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 9990.876000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 9990.884000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 9990.888000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 9990.896000] ath_intr(): MIB retry count: 3, status: 0x1000, mibstatus: 0x1001000
[ 9990.904000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 9990.912000] ath_intr(): MIB retry count: 3, status: 0x1001, mibstatus: 0x1001000
[ 9990.920000] ath_intr(): mibstatus: 0x1001000, HAL_INT_MIB is not cleared.
[ 9990.924000] count: 5, saved: 1001009, masked: 1001
[ 9990.932000] wifi0: hardware error; reseting

Notice that interval between wifi0: hardware errro; resetting ... is about ~1000 seconds. Hmm, a coincidence.

09/20/06 17:21:49 changed by valins@soften.ktu.lt

  • attachment ath-mib-irq-status-mini.diff added.

Reset radio if ath_hal_mibevent() fails to reset HAL_INT_MIB flag. Simpler way to solve the soft-lock up problem.

09/22/06 09:47:40 changed by valins@soften.ktu.lt

Any progress with patches ? Anyone tried - I have a device that is running under stress test for two days already, no sign of lockup.

09/22/06 19:39:23 changed by mentor

Can you sign those patches off? DevDocs/SigningPatches

09/23/06 02:07:20 changed by mentor

Could you try with latest SVN as I notice your test platform is a Big Endian IXP425 on which some trouble was being had with PCI Memory Space Register access.

This looks like a very odd bug, as it is the HAL that is having the problem here, but htis bug hasn't been reported to my knowledge on other platforms (e.g. FreeBSD).

09/25/06 10:11:05 changed by valins@soften.ktu.lt

Patch was generated and tried on r1724. During tests I saw periodic h/w restarts (without them device would simply deadlock). For what is it worth, I can tell that I've noticed that this problem shows up when pushing h/w to the limits. The higher speeds data are reached the problem is triggered sooner.

Testing environment:

Two identical IXDP425 boards, with Wistron CM9 (atheros) radios.

[IXDP425]<--(A:30db)---- wire---->(A:10db)---[IXDP425]

Where (A:XX), is attenuator. RSSI at one end is always about 64, the other end 51. Tests results will be attached (huge, 2.9MB PDF file). There you can see throughput, RSSI changes during tests & etc).

On each device I have 'nepim' running (see: www.nongnu.org/nepim/). For scenarios are done:
            
         IXDP425 (AP)     IXDP425 (STA)     test mode        avg. throughput
1.        nepim (s)          nepim (c)         duplex            ~39Mbps
2.        nepim (c)          nepim (s)         duplex            ~39Mbps
3.        nepim (s)          nepic (c)         simplex           ~31Mbps
4.        nepim (c)          nepim (s)         simplex           ~30Mbps

This is an attempt to measure performance, stability in isolated environment. I think this is exactly the reason why such bug was never reported by other people around the Internet. Though I can tell you for sure that the same bug can be triggered in the same scenario over the air, only that it takes much much longed then on wire. Hope this explain a bit how I am reproducing 100% this problem.

09/25/06 10:16:11 changed by valins@soften.ktu.lt

It appears there are max 1MB attachment limit, results can be obtained from barclay.balt.net/~zilvinas/traffic.pdf.

09/25/06 10:24:51 changed by mrenzmann

  • attachment traffic.pdf added.

Traffic measurement results

09/25/06 10:25:34 changed by mrenzmann

Size limitation for attachments raised to 2500000 bytes, mentioned PDF attached to this ticket.

09/25/06 10:38:24 changed by valins@soften.ktu.lt

Thank you mrenzmann!

As for tests against the latest madwifi-ng r1734, kernel is crashing. See #914.

09/25/06 10:39:19 changed by valins@soften.ktu.lt

Sorry , I mean r1732 instead of r1734.

09/25/06 17:01:00 changed by Žilvinas Valinskas <valins@soften.ktu.lt>

  • attachment ath-mib-irq-status-mini.2.diff added.

Signed-off-by: Žilvinas Valinskas <valins@soften.ktu.lt>

10/11/06 16:58:14 changed by Žilvinas Valinskas <valins@soften.ktu.lt>

Hello,

what about fate of this particular patch ?

10/19/06 09:57:19 changed by Žilvinas Valinskas <valins@soften.ktu.lt>

I've just tried on x86 platform (WRAP.2c board). Linux kernel 2.6.18, madwifi-ng driver + ath-mib-irq-status-mini.2.diff patch. I am seeing the same messages :

[ 9461.544000] ath_intr(): MIB retry count: 1, status: 0x1000, mibstatus: 0x1000000
[ 9461.564000] ath_intr(): MIB retry count: 1, status: 0x1000, mibstatus: 0x1000000
[ 9461.588000] ath_intr(): MIB retry count: 1, status: 0x1000, mibstatus: 0x1000000
[ 9461.612000] ath_intr(): MIB retry count: 1, status: 0x1000, mibstatus: 0x1000000
[ 9461.628000] ath_intr(): MIB retry count: 1, status: 0x1000, mibstatus: 0x1000000
[ 9461.652000] ath_intr(): MIB retry count: 1, status: 0x1000, mibstatus: 0x1000000
[ 9461.672000] ath_intr(): MIB retry count: 1, status: 0x1000, mibstatus: 0x1000000
[ 9463.748000] ath_intr(): MIB retry count: 1, status: 0x1000, mibstatus: 0x1000000
[ 9463.772000] ath_intr(): MIB retry count: 1, status: 0x1000, mibstatus: 0x1000000
[ 9463.792000] ath_intr(): MIB retry count: 1, status: 0x1000, mibstatus: 0x1000000
[ 9463.828000] ath_intr(): MIB retry count: 1, status: 0x1000, mibstatus: 0x1000000
[ 9463.852000] ath_intr(): MIB retry count: 1, status: 0x1000, mibstatus: 0x1000000
[ 9463.892000] ath_intr(): MIB retry count: 1, status: 0x1000, mibstatus: 0x1000000

So it seems that the same problem affects all architectures (I've tried on IXP425/BE first). As a result it is a question of time and combination of some factors will result in soft lockup.

07/31/07 19:21:44 changed by mentor

@valins: This patch looks horribly hacky to me.