Sunday, June 15, 2014

The modern developer: Understanding TCP

A couple of years ago I am not sure when interviewing candidates that I would have delved into their knowledge of the TCP protocol.

However I think I've changed my mind, why?

Reason one: The Cloud. The marketer tells us that the cloud means that we don't need to care where our servers are or where our applications are deployed. But for the developer I think the opposite is true. Suddenly rather than internal high speed network links our applications are being deployed onto commodity hardware with shoddy network links that regularly go down.

Reason two: Micro-service architecture. As we split up our applications into small components we're also adding to the number of integration points, most of those integration points will be over TCP.
So what should a well rounded developer know about their system and its dependencies:
  1. For any calls outside of their application's process:
    1. What is the underlying protocol?
    2. What is the connection timeout for that protocol?
    3. What is the read timeout for that protocol?
    4. What is the write timeout for that protocol?
  2. Are there any firewalls between your application and its dependencies?
  3. Does the traffic sent between your applications, or your application and its dependencies go over the Internet?
  4. What happens if a dependency responds slower than usual at the application level?
  5. What happens if a dependency responds slower, or not at all, at the protocol level?
And I think the best way to know the answer to these question is to test these scenarios. And to do that they need a good understanding of how TCP works and when it can go wrong. They need to be able to use tools like tcpdump, wireshark and netcat. How many "Java developers" do you think fall into this category and would test these scenarios? 

How many would say: Well I call a Java method that does the connection, what do I care?

As soon as you remember that the people writing these libraries are just as human as you, you might have second thoughts about not testing the "too edge-case". Most people are surprised to realise that a lot of libraries just use the operating system's timeout value for TCP connections, which is usually measured in minutes, not seconds. How do you explain to your users why your application hung for 10 minutes?

I no longer see any of the above scenarios as edge cases, it just might take a few weeks in production for each of them to happen.

No comments: