Saturday, January 31, 2015

Learning From Mistakes

Here's a good tagline for this post: Mistakes happen. They are not the end of the world. They can be fixed or compensated for. What matters most is learning from them. And, yeah, in software, proper testing is important!

A quote attributed to Thomas John Watson, Sr. reads, "Recently, I was asked if I was going to fire an employee who made a mistake that cost the company 600,000. No, I replied, I just spent 600,000 training him. Why would I want somebody to hire his experience?"

I appreciate that philosophy. Mistakes, when they truly are mistakes and are not the result of ill intent or a demonstrated inability or refusal to learn, should be regarded as teaching opportunities. I firmly believe that success and failure together are the best way to get a "well-rounded" education in life.

This week I made a minor mistake at the day job. I renamed all the internal groups in our JIRA (issue and task management software) in preparation for synchronizing groups from our directory server. All seemed to go fine until a coworker appeared in my door and said that everyone on his team had just lost their workflow buttons. They had been there a few hours earlier. Hmmm... Turns out, JIRA refers to groups by name in lots of places, and due to the ultra-generic (flexible?) way big Java applications all seem to be designed these days, it was impossible for me to have written a script to find all these corner cases. Now that I've figured out where they were, I have been able to deal with the two or three other cases during the week where someone said they couldn't see something they were supposed to be able to see.

Lessons learned? First, seriously, why was I doing this in production first? Second, I learned something about the tool I would never have learned otherwise. I am better for it, and therefore my company can be better for it as well. Third, most mistakes are are not unrecoverable. And fourth, even if a mistake is unrecoverable, there is always a way forward from it, and things can get back on track.

Ok, I could end this post right here, and you might judge it to be a nice little piece of possible wisdom that most people already share. But, like I said, this was minor mistake. I've been the cause of some major headaches, mostly very early in my career. Read on for some entertainment...

What happens when I delete the C:\DOS directory?

I used a student loan and grant money to buy my first computer in college (a 386SX with 2MB RAM and an 80MB hard drive, I think this was 1992). This way I wouldn't have to spend hours upon hours  in the computer labs to work on assignments, and could be home with my beautiful wife. Having a computer at home gave me plenty of opportunities to tinker and try things. Curious about what would happen, I one day decided to delete the C:\DOS directory.  Imagine my surprise when the computer wouldn't boot after that. I learned a little more about what an operating system actually does (I was in my first year of computer science at this point). After a little time panicking, seeing my wife's realization that I might have just thrown away $2000 only added to the horrible feeling in the pit of my stomach. In desperation, I put in floppy disk #1 and tried rebooting. Relief! It booted and offered to install DOS again. More learning.

Too much whitespace in the code in a 24-line 3270 terminal is a problem

Fast forward a couple of years. I am now working as a student programmer in the university's financial department, assigned to work on the payroll system, among other things. One day, upon arriving at work, the payroll team lead asked me if I had tested my recent changes thoroughly enough. Of course, I thought I had, but... Accounting had called. Somehow, this payroll was double what it should have been.

Here's what the problem was. We accessed the development environment on the mainframe via 3270 terminal emulators from our PCs. When paging up and down, the editor leaves one line from the previous page visible to help provide some continuity between the pages. I knew at one point in the logic I was updating I would need to make the call to the ApplyPayment subroutine (or whatever it was called). I found the place in the code where I thought it should be called and added the appropriate line. I was right about it being the proper place for the call, because the call was already in that place, but it had scrolled off the screen. The line that was carried over from the previous page was a blank line, and I was only barely scanning the code as it was scrolling by. There were two or three extra blank lines after the call, which increased the chances that the line I needed to see would not be the line that carried over from the previous screen.

Luckily the accountants caught it before any real damage was done. Every payroll run gets audited (thankfully) before releasing it to the printers and the direct deposit systems. A quick fix, a re-run of the payroll, and a few hours later, the checks were in the mail.

More learning. Testing is important. Good test data is important. But here's an observation. It's difficult to teach college students how to write good tests when they don't have enough experience to understand all the things that usually go wrong in software. Experience is so important to becoming competent in whatever you're doing in life.

Where did all the buildings go?

Fast forward another couple of years. The payroll mistake didn't cause me to be tossed from the career path, doomed to a life of ___(insert_your_least_favorite_menial_job_here)___. Just before graduating, they offered me full-time employment. (I should add that although I'm only highlighting mistakes in this post, I worked my tail off and tried to learn all I could. That overcomes a multitude of sins.)

I was now in charge of the capital equipment inventory system, which is a fancy accounting term that means watching the value of buildings and desks and things go down, usually for tax reasons. I don't even remember what change I was making to the system that day. I coded it up and tested it in the development database, and upon seeing that it did what I expected it to do, promoted it to production.

The next morning, another more experienced developer, the one whose role I had taken over on this part of the system, asked me whether I had tested the changes I had made. I said, Yes. To which he responded, "Oh, maaaan. Something went really wrong last night." More than half of all the capital equipment (the buildings and everything bolted to the floors in them) was missing from the database. It turns out there was a condition in the data I hadn't accounted for, and that wasn't represented in the dev data we had to test with. Crud. I spent 36 straight hours at the office trying to reconstruct the data, but there wasn't enough there to reliably recreate everything. Month-end was coming, and pressure was high.

What's that? Just restore from the backup? Yeah. After a day and half of trying to avoid doing that, we finally turned to the operations department to request the restored data. After a few hours of closed door conversations, they came sheepishly back and said that their backup job had been failing every night for the last six months. The operators had been ignoring the messages on the console. After so long, they just started telling each other that message always happens so it doesn't mean anything. College students. You could sort of give them a bit of a pass, since most of them didn't really know what those messages meant anyway. But the system admins? Never, in six months, did they bother to review the logs?

At this point, yelling and screaming accomplishes nothing, and the business folks over in the administration building were pretty calm people. The accounting director decided to close the month early, using data manually entered from the last report, two weeks old. I fixed my bug and everything proceeded forward from there. Harm was done. Time was lost. Multiple people contributed to this disaster.

What did I learn? Again, testing is important. And having good test data is critical. And if you have any sysadmin responsibilities, it's a good idea to develop a habit of surveying the logs every day as a sanity check.  And, if you're going to work in a job, be proactive and learn all you can about what you're working with.

These whoppers all happened 20+ years ago. To be sure, I have continued to learn a few things by not doing them right the first time. But nothing has really come close to the levels of disaster that I have caused, or potentially caused, in those early years. I suppose this is what we call EXPERIENCE. Life would have been better in those occasions had I not made the mistakes I did. But knowing what can go wrong, really knowing it, makes me more capable now. The only real mistake in life is not learning from the experience when it doesn't go right.





Saturday, January 17, 2015

Monitoring Web Service Performance with Elasticsearch, Logstash and Kibana

Elasticsearch, Logstash and Kibana (www.elasticsearch.org) are fantastic open source products for collecting, storing, monitoring and analyzing events.

Here is one way you can configure Logstash and Kibana to monitor system loads and web application response times.  (Elasticsearch, while powerful, does its basic job so well that I did not need to configure anything for it in this example, other than to set it up so Kibana could connect to it, which Kibana told me how to do the first time I tried.)

Configuring Logstash

Note: This configuration works on Ubuntu servers. Your environment may need slight adjustments.

input {
   # System Load
    exec {
        # We are generating our own events here, so we 
        # get to assign a type of "system-loadavg" so 
        # Logstash can apply a specific filter to them.
        type => "system-loadavg"

        # Produce the system loads on a single line.
        # Example:
        #     0.23  0.18  1.03
        command => "cat /proc/loadavg | awk '{print $1,$2,$3}'"
     
        # Do this every 60s while Logstash is running.
        interval => 60
    }

    # Web Application Response Time
    exec {
        # Again, generating our own events, with type 
        # "ws-ping"
        type => "ws-ping"


        # (Optional) Add a field to our custom event.  If 
        # you monitor more than one service this way, you 
        # will be able to distinguish them using this field.
        add_field => { "service" => "jira" }


        # Execute a couple of commands to get a web 
        # page from JIRA
        #    /usr/bin/time    Uses the Linux time command, 
        #                     not the Bash built-in function
        #    -f '%e'          Specify the output should just 
        #                     be the real time, in seconds.
        #    2>&1             Redirect time's output from 
        #                     stderr to stdout for Logstash
        #
        #    curl             Just google this for an idea 
        #                     of its sheer awesomeness.
        #    -s               Silent mode, don't want the 
        #                     progress bar in the output.
        #    -k               Don't check SSL certificate 
        #                     (this is acceptable for our 
        #                     self-signed internal cert)
        #    -o /dev/null     Send the output to /dev/null. 
        #                     We don't care what the page
        #                     contains, just how long it 
        #                     takes.
        #    -u someuser:somepass 
        #                     We do not want to measure the 
        #                     static login page, so we'll 
        #                     need curl to authenticate with 
        #                     JIRA so it can get to a dashboard.
        #    https://our.jira.host/jira/ 
        #                     The page to load.
        #
        command => "/usr/bin/time -f '%e' curl -sk -o /dev/null -u someuser:somepass https://our.jira.host/jira/ 2>&1"

        # Do this every 60s while Logstash is running
        interval => 60
    }
}


filter {
    # Transform the system-loadavg event by adding names 
    # to the numbers
    if [type] == "system-loadavg" {
        grok {
            # Parse/match the event using the pattern 
            # of three numbers separated by spaces. Map 
            # the numbers to float numbers in Elasticsearch 
            # (otherwise, we won't be able to plot them.)
            match => { "message" => "%{NUMBER:load_avg_1m:float} %{NUMBER:load_avg_5m:float} %{NUMBER:load_avg_15m:float}" }
         
            # Only store matched values that have names 
            # provided for them (this is the default)
            named_captures_only => true
        }
    }
 
    # Transform the ws-ping event by adding a name to 
    # the number created by the input event.
    if [type] == "ws-ping" {
        grok {
            match => { "message" => "%{NUMBER:responsetime:float}" }
        }
    }
}


output {
    # Send all events to Elasticsearch.  Be sure to configure 
    # the correct host.
    elasticsearch { host => "your.elasticsearch.host" }
}


Logstash will now generate performance monitor events every minute and send them to Elasticsearch.


Visualizing in Kibana

Here's how to configure your Kibana dashboard to watch and visualize these events.



Add the following queries to your dashboard, as two separate queries.  You can optionally set the color and alias for the queries so they look better on the screen.  This is done by clicking the color dot in the query field in Kibana.

  host:our.jira.host AND type:system-loadavg

  type:ws-ping AND service:jira

Notice that Logstash automatically adds a "host" field to the event.  You can see that our custom event types and fields have made it, too.

To graph the system load average, add a new panel to your dashboard with the following parameters.

Panel type: Histogram
Title: 5 Minute Load Average
Chart Value: mean
Value field: load_avg_5m  (We named this in the Logstash configuration)
Chart option: Line (not bars), not stacked
Queries: selected, then select the query that corresponds to the system-loadavg search you defined above.  It will be in the list Kibana provides for you to select.

To graph the response time, do the same thing as above, but select the responsetime field for the Value field, and the query that corresponds to the ws-ping event type as defined above.

That's it!

Here are a couple of things I learned along the way.

  • Kibana didn't update my charts at first.  The updating icon on each panel just sat there, spinning.  The problem was that I hadn't told Logstash to send the numbers into Elasticsearch as numbers, so it sent them as strings.  Kibana was getting errors back from Elasticsearch, but wasn't telling us anything about it.  Using the Inspect tool (the "i" icon on the panel) gives you a curl command to see the result of the query for yourself.
  • Kibana didn't give us the option of plotting the actual value, so we use the mean to provide the closest representation of the value over the time between plot points.  It's a good compromise since it smoothes the plot somewhat.
  • We collect more events from our application logs and Linux system logs. Applying a grok pattern to those was more difficult, and the docs could be better.  But there is a Grok Debugger that allows you to paste in a sample of your event data and play with the grok patterns to see what it is able to match.  Once I found this, my work for the week was done in no time at all.
This is a short intro to a very powerful event management platform.  I hope it helps you get up and running a little faster.

Tuesday, January 6, 2015

Paying the Indie Dues and Simplifying My Life

Let's be honest.  It's lucky for me you happened by my little blog.  I'm nobody of any sort of fame.  I wouldn't say that I'm one of the world's, or the industry's, or even my circle of friends', best thinkers. I'm just a regular person.  A software developer.  I have a day job that more than pays the bills.  I'm really happy there.

Why, then, would I dream of becoming an independent software developer?  It must be because the life of an indie is so glamorous.  I mean, who wouldn't love working from anywhere you want, on whatever you want. And who wouldn't want hundreds of thousands of dollars thrown at you each year as the general population of the world falls all over themselves to pay you a couple of bucks to have the pleasure of using your beloved software creation?

Oh yeah. Apparently, it's not like that.

I am thoroughly impressed by indie developers that can make a living at what they're doing. They work hard and most of them enjoy none of the benefits mentioned a couple of paragraphs ago. But they're making a living.  I listen to many of them on podcasts, and one thing I've come to understand (we all know it, but we don't all understand it) is that their success is the result of years of working at it.

If you look at My Apps, you'll see that I've developed a couple of games for iPhone and iPad.  They're fun, or so I'm told by all my family and friends who have played them (not all my family and friends, mind you, just the ones who have played my games). I admit I had dreams of going viral and retiring on it, or at least paying off my mortgage and sending my kids to college on a TreeDudes scholarship.  The reality is, they're only my second and third attempts at publishing indie software.  My first attempt was about 12 years ago, when I wrote a hex file viewer for Palm OS and sold two copies of it via Handango.  So, between that app, and my two recent iOS games, plus the ad revenue from the games, I have earned so far a grand total of about $40. I suppose the dues to be paid for success are somewhat higher than where I am currently at.  It's all good.  I'm not angry or disillusioned.  Of course I'm disappointed, but it's the same kind of disappointment you experience when you buy a raffle ticket and you don't win.  What else would I expect?

Rovio produced 51 titles before Angry Birds made them instantly famous. David Smith (@_DavidSmith) has a whole portfolio of apps he maintains.  Daniel Jalkut of Red Sweater does, too.   His podcasting partner, Manton Reece, is the same.  (Can you tell which podcasts I listen to the most?)  Interestingly, Manton's app business is his side job.  He has mentioned often in his Core Intuition podcast that he has a day job, too.

Still being honest... I did more than just dream of wild success. I prepared for it. I registered my own domain,  formed an LLC, got an employer ID number (EIN) from the IRS, opened a business account at the bank, and converted my personal Apple Developer account to a business account.  After all, I had to look legitimate so nobody would realize I was just a small potato.

Looking back, I can now understand why my wife (she is so patient and understanding) always looked at me and said, "Well, I don't think all that is necessary, but you're the one who knows about these things, so I trust ya."  $40 in revenue doesn't even come close to covering the hosting fees, domain registration fees, Apple Developer fees, and small account fees at the bank.  And, oh yeah, I had to pay extra for Turbo Tax to help me do my small business taxes, since I had a small (very small) business.  But, after operating on a loss, at least my tax bill was in my favor!

I didn't need any of that stuff to publish my software.  I've learned a valuable lesson.  I'm letting my domain expire. It was always a bit of a pain having to explain that it was "cornerstone" but spelled "kor-ner-sto-ane." We did that so the domain name would be unique.  Yep.  The only thing I hosted on that site was this blog, which was easily moved over to Blogger.com and I actually like the platform.  It's nice not having to worry about Wordpress updates anymore.  My LLC has been terminated.  It's not like it was really going to protect me from a lawsuit, anyway.  The bank account is closed, eroded entirely away by low-balance fees.  Oddly enough, the EIN gets to stay around forever.  I'm sure I'll forget about it someday.  And, yes,  I've opened a new Apple Developer account in my own name again since Apple doesn't convert business accounts back to personal accounts, darn it.

I am not giving up on developing my own little portfolio of apps, though.  I've got a few ideas that I'm thinking about.  One of them will soon be moved from the back burner to the front, and I'll be off developing something new again.  But before I start that, there's an awful lot of home repair and yard work that needs to happen!

I am not foolish enough (anymore) to hope for overnight success,  but I am still counting on life providing for me what it does for others who pay the price required for success. Success takes steady, consistent work.  As I consider it now, maybe the reason I have a good day job is because I've worked at it nearly every day for two decades.  Gosh, imagine that!

I really do feel fortunate that you've read my blog post all the way to the bottom.  Thanks for that.  I'd love to hear from you in the comments below.