Monday, March 31, 2014

Meraki Sparky boards, and constant resetting

There's a Mesh internet project at Sudo Room and they've been doing some great work getting a platform up and running. However, like a lot of volunteer projects, they're working with whatever time and equipment they've been donated.

A few months ago they were donated a few hundred Meraki Sparky boards. They're an Atheros AR2317 SoC based device with an integrated 2GHz 802.11bg radio, 10/100 ethernet and.. well, a hardware watchdog that resets the board after five minutes.

Now, annoyingly, this reset occurs inside of Redboot too - which precludes them from being (fully) flashed before the unit reboots. Once the unit was flashed with OpenWRT, the unit still reboots every five minutes.

So, I started down the path of trying to debug this.

What did I know?

Firstly, the AR2317 watchdog doesn't have a way of resetting things itself - instead, all it can do is post an interrupt. The AR7161 and later SoCs do indeed have a way to do a full hardware reset if the watchdog is tickled.

Secondly, redboot has a few tricksy ways to manipulate the hardware:

  • 'x' can examine registers. Since we need them in KSEG1 (unmapped, uncached) then the reset registers (0x11000xxx becomes 0xb1000xxx.) Since its hardware access, we should do them as DWORDS and not bytes.
  • 'mfill' can be used to write to registers.
Thirdly, there's an Atheros specific command - bdshow - which is surprisingly informative:

RedBoot> bdshow
name:     Meraki Outdoor 1.0
magic:    35333131
cksum:    2a1b
rev:      10
major:    1
minor:    0
pciid:    0013
wlan0:    yes 00:18:0a:50:7b:ae
wlan1:    no  00:00:00:00:00:00
enet0:    yes 00:18:0a:50:7b:ae
enet1:    no  00:00:00:00:00:00
uart0:    yes
sysled:   no, gpio 0
factory:  no, gpio 0
serclk:   internal
cpufreq:  calculated 184000000 Hz
sysfreq:  calculated 92000000 Hz
memcap:   disabled
watchdg:  disabled (WARNING: for debugging only!)

serialNo: Q2AJYS5XMYZ8
Watchdog Gpio pin: 6
secret number: e2f019a200ee517e30ded15cdbd27ba72f9e30c8


.. hm. Watchdog GPIO pin 6? What's that?

Next, I tried manually manipulating the watchdog registers but nothing actually happened.

Then I wondered - what about manipulating the GPIO registers? Maybe there's a hardware reset circuit hooked up to GPIO 6 that needs to be toggled to keep the board from resetting.

Board: ap61
RAM: 0x80000000-0x82000000, [0x8003ddd0-0x80fe1000] available
FLASH: 0xa8000000 - 0xa87e0000, 128 blocks of 0x00010000 bytes each.
== Executing boot script in 2.000 seconds - enter ^C to abort
^C
RedBoot> # set direction of gpio6 to out
RedBoot> mfill -b 0xb1000098 -l 4 -p 0x00000043
RedBoot> x -b 0xb1000098
B1000098: 00 00 00 43 00 00 00 00  00 00 00 00 00 00 00 03  |...C............|
B10000A8: FF EF F7 B9 7D DF 5F FF  00 00 00 00 00 00 00 00  |....}._.........|

RedBoot> # pat gpio6 - set it high, then low.
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002

.. then I manually did this every minute or so.

RedBoot>
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002

.. so, the solution here seems to be to "set gpio6 to be output", then "pat it every 60 seconds."

I hope this helps people bring OpenWRT up on this board finally. There seems to be a few of them out there!

Thursday, March 20, 2014

Adding chipset powersave support to FreeBSD's Atheros driver

I've started adding some basic powersave support to the FreeBSD Atheros ath(4) driver. The NICs support putting parts of the device to sleep to conserve power but.. well, it's tricky.

In order to make things consistent, I either need to not do things when the NIC is asleep (for example, doing calibration when the NIC isn't running), but I also need to ensure that I force the NIC awake when the NIC may be asleep. During normal running, the NIC may have put itself into temporary sleep whilst waiting for some packets from the AP to signal that it needs to wake up. So I will also need to force the NIC awake before programming it.

So, before I start down the path of handling the whole dynamic power management stuff, I figured I'd tackle the initial bits - handling powering on the NIC at startup and powering it off when it's not in use. This includes powering it down during device detach and suspend, as well as when all of the VAPs are down.

This is turning out to be slightly more complicated than I'd like it to be.

The first really stupid thing I found was that during the interface down process, the VAP state change from RUN -> INIT would reset the BSS, which included re-programming the slot time. So, I have to wake up the hardware when programming that. It can then go back to sleep when I'm done with it.

Now there's some issues in the suspend path with the NIC being marked as asleep when it is being reset, which is confusing - the NIC should be woken up when ath_reset() is called. So, I'll have to debug these.

The really annoying bit is that if I read a register whilst the silicon is asleep, the reads return 0xDEADBEEF. So if I am storing the register contents anywhere, I'll end up storing and programming a potentially totally invalid value.

There's also some real problems with race conditions. I can put the power state changes behind a lock, but imagine something like this:

* ATH_LOCK; force awake; do something; ATH_UNLOCK .. ATH LOCK; do some more; put back to sleep; ATH_UNLOCK

Now, if a second thread puts the NIC back to sleep in between those two lock sections, the second "do some more" work may occur once the NIC was put to sleep by said second thread. So I have to correctly track if the NIC is being forced awake by refcounting how many times its being forced awake, then when the refcount hits zero and we can put it to sleep, put it back to sleep.

Once this is all done, I can start down the path of supporting proper network sleep - where the NIC stays asleep and wakes up to listen for beacons and received frames from the AP. I then choose to force the NIC awake and do more work. I have to make absolute sure that I don't queue things like transmitted frames or add more frames to the receive queue if it may fall asleep. There's also some mechanisms to have a transmit frame put the NIC to sleep - there's a bit that says "when this frame is transmitted, transition the NIC back to sleep." I have to go and figure out how that works and implement that.

But for now, let's keep it simple and debug just putting the NIC to sleep when it's not in use.

Monday, March 10, 2014

Porting over the AR8327 support

It's been a while since I posted. I'll post about why that is at some point but for now I figure it's time I wrote up the latest little side project - the Atheros AR8327 switch support.

The AR8327 switch is like the previous generation Atheros switches except for a couple of very specific and annoying differences - the register layouts and locations have changed. So it's not just a case of pretending it's an AR8316 except for the hardware setup - there's some significant surgery to do. And no, I did try just ignoring all of that - the switch doesn't come up and pass packets.

So, the first thing was to survey the damage.

The Linux driver (ar8216.c) has a bunch of abstractions that the FreeBSD driver doesn't have, so that's a good starting point. The VLAN operations and VLAN port configuration stuff is all methods in the Linux driver, so that was a good starting point. I stubbed most of the VLAN stuff out (because I really didn't want it to get in the way) - this turned out to be more annoying than I wanted.

Next was the hardware setup path. There's more configurable stuff with the AR8327 - there's two physical ports that I can configure the PHY/MAC parameters on for either external or internal connectivity. I just took the code from Linux (which yes, I have permission to relicence under BSD, thanks to the driver authors!) and I made it use the defaults from OpenWRT for the DB120. The ports didn't properly come up.

I then realised that I was reading total garbage from the PHY register space, so I went looking at the datasheet and ar8216 driver for some inspiration. Sure enough, the AR8327 has the PHY MDIO bus registers in different locations. So after patching the arswitch PHY routines with this knowledge, the PHYs were probed and attached fine. Great. But it still didn't detect port status changes.

So, back to the ar8216 driver. It turns out that there were a few things that weren't methodized - and these were the bits that read the PHY status from the switch. Both drivers didn't just poll the PHYs directly - they read the switch registers which had a summary of the port status. So, I taught the driver about this and voila! Port status changes worked.

But, no traffic.

Well, there's a few reasons for this. It's a switch, so I don't have to setup anything terribly difficult. The trick here is to enable port learning and make sure they're all in the same VLAN group. Now, here's where I screwed up and I found a bug that needed working around.

The port setup code did enable learning and put things into a vlan group.

Firstly, I found this odd behaviour that I got traffic only when I switched the ethernet cable to another port. Then learning worked fine. I then found that the ar8216 driver actually triggers a forwarding table flush upon port status change, so I added that. This fixed that behaviour.

But then it was flooding traffic to all ports. This is kinda stupid. What did I screw up? I put each port in a separate vlangroup, rather than put them in the same vlangroup. Then, I programmed the "which ports can you see?" to include all the other ports. What this meant was:
  • The forwarding table (ie, what addresses were learnt) were linked to the vlangroup the port is in;
  • .. and when the switch did a lookup for a given MAC on another port, it wouldn't find it, as the address in the forwarding table showed it was for another vlangroup;
  • .. so it would do what switches do when faced with not knowing about the MAC (well, and how I had configured it) - it flooded traffic.
The solution was thankfully easy - I just had to change the vlangroup (well, "port vlan" here) to be '1', instead of the port id. Once this was done, all the ports came up perfectly and things worked great.

So, this now works great on the Atheros DB120 reference board. It's not working on other boards - there's likely some timing issues that need to be resolved. But we're making progress!

Finally, I spent a bunch of time porting over the port configuration and LED configuration stuff from OpenWRT so I didn't have the driver just hard-coded to the DB120 board. I'll update the configuration and code when I get my hands on other boards that use the AR8327 but for now this is all I have.

Enjoy!