Please note: This project is no longer active. The website is kept online for historic purposes only.
If you´re looking for a Linux driver for your Atheros WLAN device, you should continue here .

Ticket #1366 (new defect)

Opened 15 years ago

Last modified 14 years ago

TCP checksum errors on heavy traffic

Reported by: anonymous Assigned to:
Priority: major Milestone:
Component: madwifi: other Version: v0.9.3
Keywords: Cc:
Patch is attached: 0 Pending:

Description

When AR5212 is operating in AP mode, TCP traffic is unreliable. After transfering several megabytes, TCP experiences checksum errors from which it cannot recover. It retries; sometimes it succeeds (resulting only in a glitch), but sometimes all retries are corrupted too, they get dropped and connection eventually times out. Corrupted packets always come in direction from station to AP.

The problem is far worse when AP is receiving data than when sending. Using ssh, I am sometimes able to receive as much as 80 MB, but I can never receive a 1000 MB file. Other way around, I can usually send a 17 GB file, but in some cases it still hangs. (I suspect this has to do with lesser probability of smaller packets being corrupted.) The problem does not affect other TCP connections that are running at the same time - each locks individually.

I am also unable to connect to Samba running on AP, and DHCP also doesn't work in 50% cases.

If I use a wire, there are no problems whatsoever. This means that the problem actually lies below IP layer.

The client is Intel Pro/Wireless 2200 on WinXP on Compaq nx7010 notebook. (Unfortunately, I am unable to test against any other client.) I am using the latest Intel drivers.

Reducing speed does not help, it merely takes longer. Enabling WEP/WPA/WPA2 does not change anything either. I also experimented with burst mode, channels, turbo, xr, power - nothing.

AP is running Fedora Core 6 using madwifi-0.9.3 from Livna (tried atrpms too). Kernel is 2.6.20-1.2948. Wlan card is TP-Link TL-WN651G. I already made sure that my signal isn't too strong. I think that Wireshark detects bad checksums reliably, because they correspond to glitches exactly.

I am unfamiliar with debug output of ath_pci. However, I noticed that there are no more traces of ath_tx_start and ath_tx_txqaddbuf when connection stalls. For example, when everything is working, this is the typical output:

ath_tx_start: skb0 c6f31ec0 [data c6f39e94 len 76] skbaddr 6f39e94
FRDS 00:19:e0:83:0d:dd->00:0e:35:78:10:f2(00:19:e0:83:0d:dd) data QoS
[TID 0] 54M

88 02 2c 00 00 0e 35 78  10 f2 00 19 e0 83 0d dd
...

ath_tx_start: Q1: (ds)c6eb2220 (lk)00000000 (d)06f39e94 (c0)4124004e
(c1)0600804c 03328000 00006d8c
ath_tx_txqaddbuf: TXDP[1] = 6eb2220 (c6eb2220)
ath_tx_start: skb0 c6f3a3c0 [data c2b6a8b4 len 112] skbaddr 2b6a8b4
FRDS 00:19:e0:83:0d:dd->00:0e:35:78:10:f2(00:19:e0:83:0d:dd) data QoS
[TID 0] 54M

88 02 2c 00 00 0e 35 78  10 f2 00 19 e0 83 0d dd
...

ath_tx_start: Q1: (ds)c6eb2280 (lk)00000000 (d)02b6a8b4 (c0)41240072
(c1)06008070 03328000 00006d8c
ath_tx_txqaddbuf: link[1] (c6eb2220)=6eb2280 (c6eb2280)
TODS 00:0e:35:78:10:f2->00:19:e0:83:0d:dd(00:19:e0:83:0d:dd) data QoS
[TID 0] 5M +40

But when it stops working it looks like this:

TODS 00:0e:35:78:10:f2->00:19:e0:83:0d:dd(00:19:e0:83:0d:dd) data 54M +58

48 01 2c 00 00 19 e0 83  0d dd 00 0e 35 78 10 f2
00 19 e0 83 0d dd 90 e0

TODS 00:0e:35:78:10:f2->00:19:e0:83:0d:dd(00:19:e0:83:0d:dd) data 54M +53

48 11 2c 00 00 19 e0 83  0d dd 00 0e 35 78 10 f2
00 19 e0 83 0d dd a0 e0

TODS 00:0e:35:78:10:f2->00:19:e0:83:0d:dd(00:19:e0:83:0d:dd) data 54M +55

It would be most helpful if somebody could try to reproduce this and post his/her results. I can post more information if needed.

This may be related to #1188 (the first comment is mine).

Change History

07/02/07 07:59:12 changed by anonymous

After observing this for some time, I'd like to add an additional note.

The "freezing" is most notable when connecting to Samba. Through wireless connection, internet browsing, Skype etc. is possible for several hours, but Samba will fail every time immediately - I can not copy a single file over SMB. (Triple checked: it is not a problem of Samba or firewall.) As mentioned, hangs are per-connection; when one connection hangs, others are not affected. So my "best guess" is that some specific data can trigger a checksum bug. I realize checksumming is offloaded to hardware?

Version 0.9.3.1 exhibits the same problem as 0.9.3.

It really puzzles me - am I the only one seeing this? I'd be glad to assist with additional testing if necessary.

07/26/07 06:10:41 changed by dyqith

A few things:

Have you tried the latest svn version?

Can you post the output when loading the madwifi driver? (either dmesg or in console when modprobe should give the output)

Also, output of /proc/interrupts for ath_pci/madwifi ?

Can you also do a scan of the environment ? (iwlist ath0 scan)

Output of "iwconfig", "ifconfig" would be good too

Maybe, also the script/commands used to setup the AP ?

So, the question is, is the problem at the sender (station) or receiver (AP).

07/26/07 14:35:08 changed by mentor

  • priority changed from blocker to major.

07/28/07 13:49:28 changed by anonymous

Will do that and post the results. However, I would like to emphasize again what I consider to be characteristic: while one TCP connection stalls, other parallel connections between the same two machines are unaffected. As I understand, this means that there's no problem in general failure of operation, just wrong TCP chacksumming triggered by unknown cause, perhaps due to data itself, some unexpected multithreading effect, or such. Therefore I will focus my research on what the TCP checksums really should be, and whether they are being transmitted and received.

This problem should be easily reproducible. I would encourage somebody to try to transfer a random 20G file in both directions with different hardware than mine. It would aid a lot in debugging to exclude hardware causes, and perhaps kernel itself.