Please note: This project is no longer active. The website is kept online for historic purposes only.
If you´re looking for a Linux driver for your Atheros WLAN device, you should continue here .

Ticket #1325 (closed defect: fixed)

Opened 12 years ago

Last modified 11 years ago

crash in ath_hal with HAL 0.9.30.13 on SMP machine

Reported by: strasak@bubakov.net Assigned to:
Priority: major Milestone: version 0.9.5
Component: madwifi: driver Version: trunk
Keywords: Cc:
Patch is attached: 0 Pending:

Description

During extensive testing of madwifi-hal-0.9.30.13 branch - svn 2323 - i have found following problem:

when i put a load on system - AMD Athlon64 dual core, SMP enabled, 32bit, equiped with 4x atheros minipci cards, 3x intel gbit PCIe ethernet, 1x onboard nvidia gbit ethernet and 3x prism2 pci cards - it sometimes lock up and only nmi watchdog force it to reboot - log from serial console follows ...

BUG: NMI Watchdog detected LOCKUP on CPU0, eip f8b74f1d, registers:
CPU:    0
EIP:    0060:[<f8b74f1d>]    Tainted: P       VLI
EFLAGS: 00000202   (2.6.21.1-ng-ring3-s2s-sqsh-l7-swan-imq-esfq-smp-kvm-amd-dnd #2)
EIP is at zz05b8c65a+0xa9/0xe4 [ath_hal]
eax: 00000000   ebx: f662b208   ecx: 00bfff38   edx: 00000000
esi: f6628000   edi: f6628000   ebp: f6628000   esp: c0471d78
ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
Process swapper (pid: 0, ti=c0470000 task=c043bd00 task.ti=c0470000)
Stack: 00000000 f6998480 00000000 f8aa3169 <7>wifi4: STA 00:60:b3:64:c2:7a TX rate lowered to 10
       00000000 00001000 f6547640 00000000 00000013 00000000 c01453d7 c0466f80 
       00000000 00000013 c0466fac c01466bc c0471f38 00000013 c0466f80 0000006c 
Call Trace:
 [<f8aa3169>] ath_intr+0x1a9/0x330 [ath_pci]
 [<c01453d7>] handle_IRQ_event+0x27/0x60
 [<c01466bc>] handle_fasteoi_irq+0x5c/0xb0
 [<c0106829>] do_IRQ+0x39/0x80
 [<c0104b9b>] common_interrupt+0x23/0x28
 [<f8b2104b>] hfa384x_from_bap+0x4b/0x80 [hostap_pci]
 [<f8b23cd2>] prism2_rx+0x212/0x330 [hostap_pci]
 [<f8b24cae>] hostap_bap_tasklet+0xbe/0xd0 [hostap_pci]
 [<c0124b7c>] tasklet_action+0x6c/0xd0
 [<c012481d>] __do_softirq+0xcd/0xf0
 [<c0124891>] do_softirq+0x51/0x60
 [<c012493d>] irq_exit+0x6d/0x70
 [<c010682e>] do_IRQ+0x3e/0x80
 [<c0112be0>] ack_ioapic_quirk_irq+0x40/0x90
 [<c0104b9b>] common_interrupt+0x23/0x28
 [<c02e92b7>] acpi_pm_read+0x7/0x10
 [<c01283e7>] getnstimeofday+0x37/0x110
 [<c0135c7f>] ktime_get_ts+0x1f/0x60
 [<c0135c0e>] ktime_get+0xe/0x30
 [<c013ae6e>] tick_nohz_restart_sched_tick+0x2e/0x140
 [<c0102120>] cpu_idle+0x80/0x90
 [<c0476d4c>] start_kernel+0x1dc/0x250
 [<c0476640>] unknown_bootoption+0x0/0x1a0
 =======================
Code: 61 8b 9f 24 2f 00 00 2b 53 48 89 d0 2b 43 40 01 87 70 2b 00 00 89 53 40 89 ca 2b 53 4c 89 d0 2b 43 44 01 87 74 2b 00 00 89 53 44 <8b> 43 40 3b 43 18 76 0f 83 ec 0c 56 e8 5a fb ff ff 83 c4 10 8d 
Linux version 2.6.21.1-ng-ring3-s2s-sqsh-l7-swan-imq-esfq-smp-kvm-amd-dnd 

I will retest it on latest svn revision of new HAL branch and report again.
BTW, it could be related to for example #1239

Change History

05/21/07 20:54:07 changed by mentor

  • version changed from madwifi-hal-0.9.30.13 branch to trunk.

Retargetting

05/25/07 22:13:08 changed by strasak@bubakov.net

it is there on latest trunk too. What is more interesting, i haven't been able to reproduce it for two days, when machine has been under high load, then i turned it off, and when turned it on lately, i forgot to run testing script. And so i know now, that when machine is idle, this bug is triggerable more easily - it is matter of 10-30 minutes till it shows up.

BUG: NMI Watchdog detected LOCKUP on CPU1, eip f8b74a9e, registers:
CPU:    1
EIP:    0060:[<f8b74a9e>]    Tainted: P       VLI
EFLAGS: 00000206   (2.6.21.1-ng-ring3-s2s-sqsh-l7-swan-imq-esfq-smp-kvm-amd-dnd #2)
EIP is at zz005b60b2+0x60e/0x800 [ath_hal]
eax: 00000003   ebx: f624b03c   ecx: 00bfff38   edx: 00000000
esi: f6248000   edi: f6248bcc   ebp: f6248000   esp: f7f83e20
ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
Process swapper (pid: 0, ti=f7f82000 task=f7fd5530 task.ti=f7f82000)
Stack: f624b03c f6248000 f6248000 f8b74f2e f6248000 f624ab84 f6248000 00000000 
       00000000 f6262480 00000000 f8aa3159 f6248000 f6263bf8 f7f83e84 013b5cb2 
       00000000 00001000 f6187240 00000000 00000010 00000000 c01453d7 c0466e00 
Call Trace:
 [<f8b74f2e>] zz05b8c65a+0xba/0xe4 [ath_hal]
 [<f8aa3159>] ath_intr+0x1a9/0x330 [ath_pci]
 [<c01453d7>] handle_IRQ_event+0x27/0x60
 [<c01466bc>] handle_fasteoi_irq+0x5c/0xb0
 [<c0106829>] do_IRQ+0x39/0x80
 [<c0104b9b>] common_interrupt+0x23/0x28
 [<c01247ab>] __do_softirq+0x5b/0xf0
 [<c0124891>] do_softirq+0x51/0x60
 [<c012493d>] irq_exit+0x6d/0x70
 [<c010fe0a>] smp_apic_timer_interrupt+0x2a/0x40
 [<c0104c58>] apic_timer_interrupt+0x28/0x30
 [<c037c618>] schedule+0x398/0x670
 [<c0102125>] cpu_idle+0x85/0x90
 =======================
Code: 10 c7 43 40 00 00 00 00 c7 43 44 00 00 00 00 83 c4 04 5b 5e c3 57 56 53 8b 74 24 10 8b be 34 11 00 00 8b 86 18 2f 00 00 83 e0 03 <83> f8 03 0f 85 2d 01 00 00 8b 9e 24 2f 00 00 80 7b 0d 03 77 19 

05/25/07 22:18:28 changed by mtaylor

Please recompile with the HAL_DEBUG flag enabled, and magic-sysrq available in your kernel. Try to use the sysrq trigger to get a back-trace of the threads with this combination.

05/26/07 22:18:06 changed by strasak@bubakov.net

mtaylor, i have did as you've said, but in time when box is locked, sysrq doesn't work at all, machine cannot be rebooted or whatever with it, only what is able to reboot machine is NMI watchdog it seems. Here is trace with HAL_DEBUG enabled :

BUG: NMI Watchdog detected LOCKUP on CPU1, eip f8b7cd31, registers:
CPU:    1
EIP:    0060:[<f8b7cd31>]    Tainted: P       VLI
EFLAGS: 00000246   (2.6.21.3-ng-ring3-s2s-sqsh-l7-swan-imq-esfq-smp-kvm-amd-dnd #3)
EIP is at zz005b60b2+0x61d/0x800 [ath_hal]
eax: 00000003   ebx: f5993098   ecx: 00bfff38   edx: 00000000
esi: f5990000   edi: f5990c08   ebp: f5990000   esp: f7f83d90
ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
Process swapper (pid: 0, ti=f7f82000 task=f7fd5530 task.ti=f7f82000)
Stack: f5993098 f5990000 f5990000 f8b7d1b2 f5990000 f5992b84 f5990000 00000000 
       00000000 f58c8480 00000000 f8b668e0 f5990000 f58c9bf8 f8aa2da5 00000000 
       00000000 f7f83e00 0177a890 00000000 00001000 f59e8620 00000000 00000012 
Call Trace:
 [<f8b7d1b2>] zz05b8c65a+0xba/0xe4 [ath_hal]
 [<f8b668e0>] ath_hal_mibevent+0x10/0x20 [ath_hal]
 [<f8aa2da5>] ath_intr+0x185/0x2e0 [ath_pci]
 [<c01454d7>] handle_IRQ_event+0x27/0x60
 [<c01467bc>] handle_fasteoi_irq+0x5c/0xb0
 [<c0106829>] do_IRQ+0x39/0x80
 [<c01467dc>] handle_fasteoi_irq+0x7c/0xb0
 [<c0104b9b>] common_interrupt+0x23/0x28
 [<f8b75edc>] zz0da484a1+0x20/0x174 [ath_hal]
 [<f8b66430>] ath_hal_getisr+0x10/0x20 [ath_hal]
 [<f8aa2cd2>] ath_intr+0xb2/0x2e0 [ath_pci]
 [<c01454d7>] handle_IRQ_event+0x27/0x60
 [<c01467bc>] handle_fasteoi_irq+0x5c/0xb0
 [<c0106829>] do_IRQ+0x39/0x80
 [<c010682e>] do_IRQ+0x3e/0x80
 [<c0104b9b>] common_interrupt+0x23/0x28
 [<c02e9437>] acpi_pm_read+0x7/0x10
 [<c01283e7>] getnstimeofday+0x37/0x110
 [<c0135c7f>] ktime_get_ts+0x1f/0x60
 [<c0135c0e>] ktime_get+0xe/0x30
 [<c013af3e>] tick_nohz_restart_sched_tick+0x2e/0x140
 [<c0102120>] cpu_idle+0x80/0x90
 =======================
Code: 83 c4 04 5b 5e c3 57 56 53 8b 74 24 10 8b be 34 11 00 00 8b 86 18 2f 00 00 83 e0 03 83 f8 03 0f 85 2d 01 00 00 8b 9e 24 2f 00 00 <80> 7b 0d 03 77 19 83 ec 04 0f b6 43 0d 40 50 6a 01 56 e8 cc f9 

05/27/07 01:20:31 changed by mentor

I've just fixed HAL_DEBUG a bit more, so it may be worth retrying...

05/27/07 13:46:10 changed by strasak@bubakov.net

with 2380 it hasn't crashed yet ... but, i forgot that last two traces - not the first one - are from madwifi patched with nbd's napi support patch. So, it probably could be responsible for higher frequency of bug appearances, because with svn 2323 and not patched madwifi it appeared once per day or two days, with napi patch it appeared once per a couple of hours and when machine has been idle once or twice per hour or even more frequently. Now, without napi patch and latest svn, it hasn't appeared, yet - i am testing it with random intervals of load and idling, also, script reboots it occasionaly.

BTW, with napi patch i have got it on similar machine even on non-smp kernel, but hasn't got it ever on the same similar machine for days on 2.6.16 kernel with very very old svn revision - r1686 - nor on another kernels with other older - 0.9.18.0 HAL.

So probably new HAL is responsible or some locking problem introduced recently, and napi patch exposes it more? I will report more as soon as it crash :)

05/27/07 15:08:04 changed by strasak@bubakov.net

Here we go again:

BUG: NMI Watchdog detected LOCKUP on CPU0, eip f8b761b1, registers:
CPU:    0
EIP:    0060:[<f8b761b1>]    Tainted: P       VLI
EFLAGS: 00000202   (2.6.21.3-ng-ring3-s2s-sqsh-l7-swan-imq-esfq-smp-kvm-amd-dnd #3)
EIP is at zz02dc1763+0x65/0x108 [ath_hal]
eax: f6638000   ebx: f6638000   ecx: c0001071   edx: 000010b0
esi: f6638000   edi: 00000000   ebp: 00000000   esp: f7fbbd68
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process klogd (pid: 995, ti=f7fba000 task=f7de20b0 task.ti=f7fba000)
Stack: 00000000 f7ac6480 00000000 f6638000 f8b66500 f6638000 c0001071 f8aa2c9e 
       00000001 00000010 c0467e00 f65f0480 00000000 00001000 f72717c0 00000000 
       00000010 00000000 c01454d7 c0467e00 00000000 00000010 c0467e2c c01467bc 
Call Trace:
 [<f8b66500>] ath_hal_intrset+0x10/0x20 [ath_hal]
 [<f8aa2c9e>] ath_intr+0x7e/0x2e0 [ath_pci]
 [<c01454d7>] handle_IRQ_event+0x27/0x60
 [<c01467bc>] handle_fasteoi_irq+0x5c/0xb0
 [<c0106829>] do_IRQ+0x39/0x80
 [<c0104b9b>] common_interrupt+0x23/0x28
 [<f8b2104b>] hfa384x_from_bap+0x4b/0x80 [hostap_pci]
 [<f8b23cd2>] prism2_rx+0x212/0x330 [hostap_pci]
 [<f8b24cae>] hostap_bap_tasklet+0xbe/0xd0 [hostap_pci]
 [<c0124b7c>] tasklet_action+0x6c/0xd0
 [<c012481d>] __do_softirq+0xcd/0xf0
 [<c0124891>] do_softirq+0x51/0x60
 [<c012493d>] irq_exit+0x6d/0x70
 [<c010682e>] do_IRQ+0x3e/0x80
 [<f8b23eed>] hostap_rx_skb+0xfd/0x140 [hostap_pci]
 [<c0104b9b>] common_interrupt+0x23/0x28
 [<c02e9437>] acpi_pm_read+0x7/0x10
 [<c01284f6>] do_gettimeofday+0x36/0x130
 [<c0123b92>] sys_time+0x12/0x40
 [<c01041c8>] syscall_call+0x7/0xb
 =======================
Code: 83 bb cc 2b 00 00 00 74 03 83 ca 40 83 bb d0 2b 00 00 00 74 03 80 ce 01 83 bb d4 2b 00 00 00 74 03 80 ca 80 83 bb d8 2b 00 00 00 <74> 05 80 ce 04 89 f6 f6 c1 01 74 03 83 ca 07 f7 c1 00 00 e0 02 

05/29/07 23:55:46 changed by strasak@bubakov.net

news - testing for about a day madwifi release - 0.9.3.1 + the same SMP kernel as above + the exactly same HW and no crash yet, well, will have to let it run for another day or two to be sure, but it seems ATM that problem is caused /or exposed/- at least on this HW - by latest HAL, 0.9.18.0 seems to be stable, well, we will see

05/30/07 14:13:38 changed by strasak@bubakov.net

After another about 14 hours of running under load, idling, rebooting itself etc. box seems to be rock solid. So, probability of fact, that HAL 0.9.30.13 is prone to problems on SMP, is increasing. I will let it run for another day or two, then i will try to use current trunk + this older - 0.9.18.0 - HAL on the same machine, to be sure it is really in HAL - or in HAL <-> this machine setup.
Anyone, if you have SMP systems handy, please test latest trunk there too, i will too test it on another SMP box with other chipset soon, to izolate the bug more.

06/11/07 10:54:56 changed by strasak@bubakov.net

i think i could say that 0.9.3.1 is stable on my SMP system - tested two of them, one described above, another one with 8 Atheros minipci cards. So problem could be in HAL, or more likely new HAL - maybe due to some optimalizations or timing changes inside? - exposed some lurking bug - in locking? - in driver code. I will report more after testing trunk again on these machines with kernel with lockdep and other debugging options enable and of course with HAL debugging.

Another interesting point is that on the same - same configuration, not exactly the same machine but exactly the same components and settings and kernel and blah - SMP machine non-SMP kernel crashed twice in 7 days too in similar way, so theory that new HAL exposed some bug - and SMP exposed it even further - sounds even more realistic. Will report more as soon as will have something to report.

06/21/07 23:05:48 changed by mtaylor

  • status changed from new to closed.
  • resolution set to fixed.

We have added locking around HAL calls in trunk. This issue should be fixed in trunk, even with the latest HAL.

06/24/07 17:30:00 changed by strasak@bubakov.net

reopening ticket just to provide feedback, so please feel free to close it again. After doing some tests, system seems to be stable on 2505 on smp kernel, no crash for about two days. Long term real-life testing will show more, but ATM bug seems to be fixed, thx guys.

06/27/07 12:17:37 changed by mrenzmann

  • milestone set to version 0.9.4.

02/11/08 06:18:21 changed by mrenzmann

  • milestone changed from version 0.9.4 to version 0.9.5.