Preamble: I’m writing this while in the air, somewhere between Ireland and Island, and it will be posted after I’m on the ground in New York later today. Hopefully….
I’m going to present multiple sessions at the Freescale Technology Forum in Austin/TX (June 22nd to 25) which is a great thing! Naturally, for me living in Europe, this means an air travel over the Atlantic ocean. Some of my work peers and (extended) family members think that air travel must be a lot of fun. Well, maybe 20 years ago? I rather say that travel today is ‘special’, with all the needed security measures, long waiting lines and all the hassles. I have travelled many times, and many times it was ok, or I was just lucky. But business travel usually is exhausting, and definitely *not* fun. Or would you buy and eat typical airplane food in a normal restaurant? I do not.
In preparation for the journey, I need to prepare upfront (presentations, conference papers, travel arrangements, …). While travelling I usually still need to work on details of my presentation material (yes, I know, but usually I have no chance to prepare everything upfront). And while I am at the conference, I still need to do your normal work (emails, preparation for the days after the journey, …) while I’m away.
This brings me to the main topic of this post: rebooting. In my class, I teach students that reliable and especially safety related systems should not need a reboot. And if there is a problem, the system shall restart instantaneously and automatically e.g. with a watchdog (or COP: Computer Operating Properly) timer.
💡 There is an excellent article about watchdogs, written by Jack Ganssle here: http://www.ganssle.com/watchdogs.htm
So my travel did not start very well today :-(. Getting up early at 5am is normal for me. Had everything packed (“did I miss anything?”), weather was cloudy, but fine at breakfast time. But when I left my house to walk to the train station, I realized that the weather has changed and it was raining strong enough to return to the house, to grab an umbrella and to restart again. Nothing serious, but for sure not a smooth start.
Then while sitting in the train, travelling to the airport train station, I opened up my laptop and started it up to get some work done. That worked OK, except that Firefox complained that an update has not been successfully installed. That would not have been a problem, except that FireFox did not respond at all anymore. Opening the task manager to kill it … With the result that the whole Windows system was getting slow and unresponsive. Managed to get the system to shutdown and restart. So the system shut down, restarted, only to automatically shut down again after a few seconds? Uh, oh! Then the system bootet again, telling me that it has a problem :-(. It asked me if I want to run a disk check. Do I have really a choice? Ahhhhhrgggg!!!!!!!
I should have taken a picture at that time of that screen on my laptop, but I was simply too shocked and worried that my SSD disk might be seriously damaged or dead. Heading to a conference with hands on material for that conference on that machine, and now the disk is dead?
💡 Yes! I do have backups, but for security reasons I do not trust the cloud to put my data there. Mabye I should change?
So I agreed that the bootloader and startup of the system should make a disk check. While it was doing this, I feared the worst case (disk dead? What else is wrong? Notebook board failure?), but at least I had a memory stick with all the conference material with me (I usually do this, just a as a backup, and I hope I do not need it).
The disk check was going on, slowly, and it took about 30 minutes. The good news was, at the end, the notebook and operating system were up and running again. Right on time when I arrived at the train station at the airport in Zurich.
Next step was to check-in for the flight. When I approached the self-service check-in terminals, I noticed that the ones for the airline the flights were booked, were not working. So moving on to the normal check-in lines, realizing that, hey, the waiting lines were not that bad. Luckily I arrived pretty early. But then, waiting in line, I realized that something is not so good: even with the lines rather short (maybe 20 or 25 travellers waiting in front of total 4 check-in desks.), it was moving, very, very slowly. Others and myself were wondering too. I realized that the supervisor moved around to deal with the situation, because time was ticking. When I finally reached the check-in desk, the person at the desk apologized for the slow system, and that they had to reboot it. I was thinking myself that I had a similar problem in the train with my system already, nodded and got my boarding passes.
While still thinking about what happened this morning, I moved on the passport control and had no issues :-). Next was the security check, and despite I had removed everything for the metal detector, that system said ‘Beeeeep!!!!!’ about me? Nothing that unusual, and even the security officer said something like “random selection by the system”. Fine, I’m all for safety and security, and passed that test.
So I thought that this phase went pretty well, went down to the underground train station which should bring me to the terminal with a quick 3 minute ride. Time was still OK to get to the terminal. Still thinking for myself that it was a good idea to get up at 5am and to take the earlier train. But then the next obstacle: I saw that message above the train entry door system: ‘Out of order, do not enter’ :-(. Ahhhrg!
Well, it happened that an empty (?) train arrived after about 2 minutes.That “do not enter” message got cleared, the door opened, and we could enter the train cars. Maybe false alarm? Getting paranoid? The doors were closing, but the train did not start moving. I was already saying to myself that this day somehow is not going well, and I’m probably going to miss my flight now. My guesses were going wild: maybe a world-wide virus has infected all electronic systems? After a few minutes (reboot?), the train cars were moving to the terminal:-).
Arrived right on time to the gate, boarding had already been started, and I had a boarding card with a seat assigned.
💡 This was really smooth that time. Or should you tell that in the past I had the pleasure several times that I had no seat allocated (and got it in the last-minute), or that seats where double booked? If that happens to you, it will help you to understand deeper why air travel is not fun (at least not to me).
In the plane, I had a seat, and I had a space for my hand luggage (oh, I could tell stories about hand luggage too!). A few minutes, and we should be ready for take-off.
Only until the Captain was speaking: He apologized that something is wrong with the computer on the plane. Nothing serious of course, and it should not affect the plane operation at all. But there is that strange error message/report which is not on the exception/approval list. As long as this message is there, he is not allowed to push back from the gate. So he informed us that, well, he has to reboot the system, and we should not be worried about it. The Captain explained to the non-techies in the plane that this is like rebooting the PC at home. Yikes! And at that time I realized what already happened before: the board information monitors went black and restarted twice or more already (I did not pay close attention). They already had disconnected power and re-powered the system several time. So again the plane (for the avionics enthusiasts: Boeing 767-300) did the same again: monitors and lights go dark, coming back after about 10 seconds. The monitors and audio system were starting the usual program (“Welcome on board”, “we are proud to have you here”, etc). Only to go dark again after 2 or 3 minutes to show yet another reboot. Oh, oh!
“Captain speaking!” after 15 minutes again. He was telling that despite several reboots, the system still shows that error message, and they cannot proceed with that message present. The do not understand why there is a problem, but they work hard with the mechanical and systems engineers on the ground to get it fixed. And the best thing what they could do is to repeat the reboot procedure.
At this time, my imagination was going wild. If I would have implemented that system, and if the crew has to reboot it, and the ‘solution’ are repeated reboots: should I stay or should I better try to leave the plane? The crew was working hard to get the plane off the ground, and safety is the first priority. Is there even an option to leave the plane? I felt the pain of the crew, and I knew they got all the help and support necessary. For a second I was thinking about raising my hand and saying “hey, maybe I could have a look at the metadata .log file?” :-). But nobody asked for an embedded specialist ;-).
Rebooting repeated, and the cockpit crew informed that the problem still exists, and that they will continue that way. And that they have now involved the technicians in Tucson (I guess in Arizona) to help them to identify and solve the problem.
There was no panic (well, except my internal engineering should which was in a panic mode). But then something happened what was really interesting: the crew did everything necessary to ease our grounding which was now over one hour. They already offered snacks. But then one of the crew members was saying to the folks in the back of the plane: “Hey, I have a quiz for you: if you tell me the answer, I provide you some champagne!”. I thought for myself: “Oh! Free alcohol! How she has my attention!” 🙂
💡 Of course for the non-adult passengers on board non-alcoholic beverages were offered.
She started with the first quiz: “Which room has no windows and no doors?”, and let the passengers thinking about the solution for a few minutes. Next one was: “What is not in the beginning, but in the middle, and not in thousand years?” Again the ones with the correct answers got a reward :-). This was repeated two or three times. I was fascinated by that initiative and approach, and I was wondering if something like this was part of their training (probably it is). I directed the passengers to think about something else, not to worry about the delay, playing a game together.
I was about to propose another quiz: “what goes up the hill with three legs, and comes down with four”, but the Captain was speaking again. This time good news :-): after probably 10 or more reboots, that error message disappeared and now things are working as they should. They had not changed anything, just reboot. But there was bad news too: The issue was that problem bubbled up to his management, and they insisted that the problem has to be investigated further, and they do not approve pushing back from the gate. And the cockpit crew cannot push back and leave the gate and take off without that approval. On one end my engineering soul was screaming, and on the other end I was deeply feeling with the crew: safety first. A headline in the newspapers like “Crashed plain already hat to reboot several times at the gate” would not be a good one.
Finally, 90 minutes after the schedule, the Captain was speaking from the cockpit again, with positive news: they are now approved to push back, and will be ready for take-off. And indeed, after doing the usual procedures, doing the taxi run, we were able to take off, and in the air :-).
So: I’m excited to be there in Austin, and despite all the travel hassles, I’m sure the stay at FTF, the experience and the discussions there will be absolutely great. If you are in Austin for FTF, I’m looking forward to meet you in person. Until then:
Happy Rebooting 🙂
PS: if I was able to post this, then at least I made it to New York (JFK). I hope not to be continued 😉
PPS: Guess what? My connection flight got delayed by 1.5 hours 😦 Will see where this ends today….