We have been chasing down a mysterious problem in a embedded application we are developing. It is a sprinkler timer to water our yard. The microcontroller is an ATmega AVR 328p, the same microcontroller that some of the Arduino platforms use.
The application has an RFM12B wireless transceiver from Hope RF, and we are using this RFM12 library: http://www.das-labor.org/wiki/RFM12_library/en.
The issue is that the application works fine for days, sending and receiving data through the RFM12B wireless transceiver, but all of the sudden, it would stop working. Resetting the device restores operation.
The device is outside, in our yard, right next to a garden bib. It is not easy to troubleshoot this problem because all connectivity that we have to the device is via the RFM12B transceiver, so if it the wireless subsystem of the application is hosed, then there really is nothing we can do to troubleshoot. In addition, the wireless command set in in the firmware is currently very limited so we do not have good debugging tools even if the wireless subsystem is up. The firmware does send debugging information through a serial port on the microcontroller, though, so we have to use that in absence of other debugging tools.
The first step we took to debug the issue was to provide remote access to the microcontroller’s serial port. For this, we used a Raspberry Pi with an USB-to-serial adapter and with a WiFi USB adapter. We then left minicom (a serial communication program for Linux) running and came back every once in while, via remote SSH session, to check. Today we found an occurrence of the issue and investigated. This is what we found:
First, we noticed that the wireless subsystem was not working when we saw the following messages on the device’s console:
[2014-08-06 17:26:53] rfm12_tx() error 3
[2014-08-06 17:26:58] rfm12_tx() error 3
[2014-08-06 17:27:03] rfm12_tx() error 3
[2014-08-06 17:27:08] rfm12_tx() error 3
We then ran a diagnostic command on the console:
# show rfm12
[2014-08-06 17:29:11] My RFM12 node ID: 111.
[2014-08-06 17:29:11] RFM12 status register: 0x023f.
[2014-08-06 17:29:11] RFM12 state = 0
[2014-08-06 17:29:11] RFM12 tx state = 162
[2014-08-06 17:29:11] Number of bytes to transmit or receive = 219
[2014-08-06 17:29:11] Current byte count = 1
[2014-08-06 17:29:11] Total tx bytes = 40622
[2014-08-06 17:29:11] Total errors = 29983
A few things caught our attention:
- The RFM12 node ID should be 12 but the output says that it is 111.
- There is no such thing as RFM12 tx state 162 in the RFM12 library; it goes up to 3 or 4.
- The command prompt should be “frontlawn# “, but it is just “# “.
Obviously, some bad memory corruption has taken place here. Let’s see what we find…
The other piece of information we found in the minicom history buffer was this:
[2014-08-06 14:36:29] frontlawn# RFM12 rx packet: flags AD., hdr = 0x6c, len = 13.
[2014-08-06 14:37:46] 00000000 3445 7365 7420 7661 6c20 6f20 33 4Eset val o 3
[2014-08-06 14:37:46]
[2014-08-06 14:38:27] RFM12 rx packet: flags AD., hdr = 0x6c, len = 14.
[2014-08-06 14:38:27] 00000000 3446 7365 7420 7661 6c20 636c 2033 4Fset val cl 3
[2014-08-06 14:38:27]
[2014-08-06 15:07:10] RFM12 rx packet: flags ..., hdr = 0x0f, len = 254.
[2014-08-06 15:07:11] 00000000 3445 7365 7420 7661 6c20 6f20 336e 2033 4Eset val o 3n 3
[2014-08-06 15:07:11] 00000010 ffff ffff 3337 fd7f fc0f fd1c 201e 07ff ....37...... ...
[2014-08-06 15:07:11] 00000020 07fb 03df 407c 741f 2dd4 040c f721 a302 ....@|t.-....!..
[2014-08-06 15:07:11] 00000030 3529 0b05 080e 0200 0000 0000 0000 0000 5)..............
[2014-08-06 15:07:11] 00000040 0000 0000 0000 0000 0000 0000 0000 0000 ................
[2014-08-06 15:07:11] 00000050 0000 0000 0000 0000 0000 0100 0000 008a ................
[2014-08-06 15:07:11] 00000060 e955 ae3c b9e0 c7f0 6f10 179c bd65 625f .U.<....o....eb_
[2014-08-06 15:07:11] 00000070 3aea 998e 925c 1f43 152b 6b49 5e0b cca2 :....\.C.+kI^...
[2014-08-06 15:07:11] 00000080 cdd5 313e ae9e 1f75 9a3d 1eea d964 bfee ..1>...u.=...d..
[2014-08-06 15:07:11] 00000090 bffe 96dc 5df9 6b2c fdb7 8dd5 daab b8aa ....].k,........
[2014-08-06 15:07:11] 000000a0 8fc5 6e42 8df9 9e7b 53bf 3cf6 fd19 7737 ..nB...{S.<...w7
[2014-08-06 15:07:11] 000000b0 2767 f67c 975b 8f5a a7ea 6e63 bd39 2258 'g.|.[.Z..nc.9"X
[2014-08-06 15:07:11] 000000c0 756e a1bf 80fd 4b56 f0e3 e7fb bb28 ef93 un....KV.....(..
[2014-08-06 15:07:11] 000000d0 e353 2308 50fe 49f6 7b4f 2300 b087 f1fa .S#.P.I.{O#.....
[2014-08-06 15:07:12] 000000e0 9581 ff47 aba7 75a2 c0eb 91fa 6b7a 80e3 ...G..u.. cialis 20mg apotheke...kz..
[2014-08-06 15:07:12] 000000f0 52e0 93c0 da66 bdfb 279b 8a07 af12 R....f..'.....
[2014-08-06 15:07:12]
[2014-08-06 15:07:12] Invalid valve number.
[2014-08-06 15:07:12] rfm12_tx() error 3
[2014-08-06 15:07:17] rfm12_tx() error 3
At 14:36:29 the controller received the command “set val o 3” (short for “set valve open 3”, which instructs the controller to open valve number 3).
At 14:38:27 the controller received the command “set val cl 3” (short for “set valve close 3”, which instructs the controller to close valve number 3).
We sent these commands and everything was good until this point.
However, at 15:07:10, we receive a giant packet of 254 bytes. That is the first sign of trouble — we only have 2048 bytes of RAM in an ATmega 328p microcontroller, so we have never configured such a large receive buffer in our application. We check the library source code and see that the size of the receive buffer is configurable and we have configured it to be 40 bytes.
Is it possible that the RFM12 library is buggy and is accepting such a large packet and storing it in a 40-byte buffer? No, we checked and it has the correct validations. This points to an issue in our code…
This is our code:
if (rfm12_rx_status() == STATUS_COMPLETE) {
pkt_hdr.__hdr_val = rfm12_rx_type();
pkt_len = rfm12_rx_len();
pkt_data = rfm12_rx_buffer();
[...]
if (pkt_len == 0)
goto pkt_processed;
/*
* Action to take on received packet depends on the payload
* type, which is the first byte of the payload.
*/
if (pkt_data[0] == RFM12_PAYLOAD_CLICMD) {
memcpy(cli_cmdbuf, pkt_data + 2, pkt_len - 2);
cli_cmdbuf[pkt_len - 2] = '\0';
cli_processcommand(cli_cmdbuf);
[...]
The first byte of the packet (pkt_data[0]) is 0x34, according to the packet dump above. 0x34 happens to be RFM12_PAYLOAD_CLICMD, so the condition of the “if ()” statement is true and we take that code path. The first instruction in that code path is a memcpy() that will copy pkt_len – 2 bytes, i.e. 254 – 2 = 252 bytes into the array cli_cmdbuf[].
cli_cmdbuf[] is a global variable declared as:
/* A command cannot be longer that this many characters. */
#define CLI_MAX_CMDLEN 80
char cli_cmdbuf[CLI_MAX_CMDLEN];
Ouch! Yes, our buffer has been overflowed with this memcpy(). By 252 – 80 = 172 bytes! This explains the memory corruption and the failure of the wireless subsystem of the application.
A couple of observations about this whole issue:
- The received packet that caused the entire problem should not have been received. We have no idea who sent it, or why it is so large, but the fact of the matter is that we received it. Yes, it obviously is bogus, but we cannot allow a stray packet to bring us down like that. The lesson re-learned here is that our application needs to be able to handle exceptional conditions such as this.
- Of course, we were not aware of such exceptional condition before. Otherwise we would have put a check in place to prevent the disastrous memcpy(). The main problem here is that we were acting under the assumption that the low level RFM12 driver would not pass to us a packet that has a received, advertised length that is larger than the configured size of the receive buffer — the low level driver checks for the available space in the receive buffer and does not store more data if there is no more space left. However, it notifies the application that there is a packet ready to be handled, even if that packet is incomplete because it did not fit in the receive buffer. We guess this behavior is debatable because the application might want to consume that packet even if it is not complete (for example, do some error handling, print a debug message, etc.).
- In the world of embedded applications, the importance of debugging using a serial port cannot be overstated — we would have had a very hard time to find this issue if we had not seen the packet dump of that 254-byte packet on the serial console.
In terms of addressing the issue — we are going to try to address the problem by checking in our application that the size of the received advertised length of the packet is less than the size of the receive buffer. If the size is larger then we will ignore the packet. We will see how this plays out…