I have been fighting with this problem for a year. I was so convinced that my board fix for ACRIS would stop the issue of the boards randomly stopping to process data. I thought I could fix everything by adding pulldown resistor to force the communication chips to be in read mode when the microcontroller wasn’t telling them what to do.
Turns out, it’s a software problem. It’s always a software problem. Seriously. I encounter so many software bugs compared to electrical bugs, it’s not even funny. Well maybe a little bit, but whatever. Even if it isn’t a software problem, usually you can fix your problem in software.
I noticed that when I unplugged the communications links and plug the lines back in, I could force the lights to freeze. So, there was something going on with the RS-485 chip, I thought.
So, I immediately read the datasheets of the SN75176B (the TI chip I use) and the MAX485 (the RS-485 chip made by Maxim, which is over 3 times as expensive). “A-ha!” I told myself, “clearly the issue is that the SN75176B produces unknown voltages when the comm link is broken whereas the MAX485 guarantees a logic ‘1’ output.” Therefore, the AVR must be going a little crazy, stopping communication. I was so convinced that I even started writing a blog post on how much cheap stuff sucks before even testing my theory.
Okay, turns out that wasn’t the issue because I made a little test setup:
On this board are two AVR microcontrollers with corresponding RS-485 chips. The left microcontroller uses tha MAX485 and the right uses the SN75176B. Much to my surprise, the SN75176B was actually better at not dying when I started disconnecting and reconnecting the communications cables.
So what was going wrong? I investigated by having the debug LEDs show the various error registers. Turns out, sometimes when you disconnect and reconnect the cable, you get framing errors (FE0 in UCSR0A is on). But wait, there’s more! When the communication freezes, the whole AVR slows down to a crawl. A .1ms operation takes a good 10 seconds or so… very odd. My guess is that sometimes, a framing error occurs and the receive interrupt is stuck in a loop waiting for it to complete. As a result, the chip seems to be running slowly. By temporarily stopping the interrupts, resetting the UART, and re-enabling the interrupts, I can save the communications bus.
Oh, one more thing. For the rework I posted above, I used 10K resistors for the pulldowns. That was stupid; I should have tested things first. For the MAX485, 10K pulldowns are more than enough. But take a look at what happens when I first start up the SN75176B.
(You don’t want to know how long it took me to get that damn picture).
10K pulldowns are apparently too weak. After using a 1K pulldown instead, things looked much better:
I should have tested that out before I reworked my lights, but unfortunately, they’ll have to stay like that because I don’t feel like taking them down and fixing them again.
I’ve been running them for almost 9 hours now at the highest brightness with very fast communication and they haven’t failed once. So I think the problem is fixed.