Sometimes removing the little sanity you have to culture a bigger organisational change and engineering paradigm is worth it
When I joined $COMPANY as a principal DevOps architect, I was shown around a couple of server cabinets. My boss pointed to these physical things called servers. “That’s an AMQP server. That’s a Zabbix server. That’s a web server”. It is probably accurate to say that in the DevOps and Cloud-First world we now live in, that this is what most people new to the industry imagine as on-prem. However to the on-prem folks that have been doing on-prem for years, this is what the late 1990s felt like a bit. Historic staffing and investment decisions got to this point, and that’s perfectly normal in a company finding its feet.
The key part here is ‘company finding its feet’. Lots of scale-up technical debt to deal with, not a lot of engineering folks, a very real want to have the best practices going forward while trying to figure out how to keep the lights on in between.
Here’s some lessons and observability I’ve had shoehorning monoliths into standalone containers, moving from the 1990s to the 2020s in less than 6 months.
You can’t have enough observability
When you think tacking an APM into your application is enough, you’re really wrong. When you think you’re dumping enough logs via stdout on the container, you’re also really wrong. It’s almost naive to think that laravel.log (or the built in driver for that matter) will provide enough to debug most issues without diving deep into containers.
Capturing logs of the service runtimes is just as important as catching the application logging driver, often a dodgy application update in your dev or staging clusters or a slight misconfiguration can cause utter frustration when trying to trace the root failure. Ensure every service dependency is getting into your logging driver, whatever that may be. In the case of my Laravel containers, I built a base image using supervisord and ensure error logs, access logs and php-fpm logs all pipe to stdout so my container runtime can pick the logs up (and usually ship them somewhere else). There are lots of ways to skin a cat.
Also, it’s easy to forget the merit of detailed infrastructure monitoring if you’re dealing with on-premise stuff. You can’t just query a Cloudwatch dashboard. Prometheus and Telegraf were absolutely key components of my stack.
Build your own base images
Lots of huge benefits here (and things to consider). Here are the cliff notes.
Building your own base images allows you to carry a core set of engineering principles across every container image you create in your organisation, giving you those as a layer each time you fork that image for a specific service. There are no easy rules for which way you should make base images, just do it in a pattern that makes sense to your operations. There are lots of dependency specific services in my current organisation, so I tend to focus on those as the first principle.
Security is also a great advantage here, as you can centrally control the versions of dependencies that many services require in one swift go. A well setup dev/staging environment and quality assurance process is required here, to ensure across an entire estate that there are no real issues. In my case, I managed to reduce the amount of time it takes to do dependency and OS patching by about 60%. Be sure to consider triggering action requirements and touch points here, as you don’t want to patch base images to leave ones in the wild unpatched. A good dependency tracking system will help here.
One other security consideration to make is in reference to supply chain attacks, which npm has managed to demonstrate a few times with attackers managing to inject malicious code to base packages right at the source. Building your own images gives you more control over where you get your runtime dependencies from in terms of container OS and execution services. Go crazy and compile your stuff from signed source repositories in your build pipeline if you feel it’s really necessary (and you have enough build minutes)!
If you do GPU accelerated AI or HPC, you will know the horrific strings of dependencies you end up with. If you do lots of similarly accelerated applications, this really does help.
Put your containers in the right repositories
Use your company’s collaboration (if any) to help dictate this, as you can remove a set of security controls for collaborative work if you use something like GitHub container repository to share code and images, under the same permissions.
Sometimes identical permissions are not okay, i.e. in the event your containers are direct shipments to end customers. Again, lots of ways to skin a cat, just consider if something is likely to have shared permissions from the start. The last thing you want to do is update every chart or stack to point to an image from a different repository. It should be recorded as part of your information security auditing procedure.
Stateful apps don’t like being forced under stateless principles
This is a buyer beware of knowing what you’re getting yourself into when you’re shoehorning a monolith into a container to behave as a standalone service. It does not change storage requirements in terms of file locking, permissions or other behaviour, and these are often areas of target when these monoliths are eventually broken down into microservices over time (where paths are re-routed via a proxy layer or similar).
Think about where your file data lives, your logs and your cache (often overlooked). Cache and sessions are a huge headache when stuck on certain shared filesystems with a dodgy load balancing algorithm in front, file locking issues do make themselves apparent fairly quickly.
Again there’s no overall rule that helps here, as really you’re guided by the constraints and best practices for the application when scaling horizontally. Take careful consideration on your load balancer configuration, your underlying shared filesystem or storage driver, and your application’s caching driver.
Developers & QA love the simplicity of their local and development environments
If you are mid-change to a microservice first architecture, where there’s a reliance on say API calls to a monolithic app to keep a function alive, the docker-compose files used by these folks are a lot simpler, easier to read and debug when something goes wrong. Rather than playing hunt the service, all of the stuff they’re looking for is in one place without having to sift through log files.
This was a huge issue as we’re defeating monoliths, microservice by microservice, function by function, but as you can imagine 4 containers for one monolithic service became a bit of a behemoth (this was slow for us as the central authentication service was baked into said monolith).
The double edged sword is some monoliths then have complex volume mapping requirements, but become quite common at least. Less volatile than the alternative.
Images can get big, and that’s sometimes ok
Before you go and make big images, consider your deployment targets on the edge and if there are network or storage constraints to consider in terms of downloading and storing your container images. Well layered images can resolve the pains of updating big images, but this may not be ideal.
Why wouldn’t you just run the individual runtime components as individual containers in pods?
This becomes a better argument when you involve Kubernetes and stick apps in a pod. Unfortunately it’s all case specific as to what the least terrible option is, and the context really matters. In my case it was a quick and easy way to migrate apps to some regions or targets that were simply docker swarms at the time and I was keen to just couple core services together for the sake of making the initial migration easier. It’s a pattern I stuck with as deployment cadence is bloody good, and it accidentally satisfied some security compliance questions I was asked by our customers. There are security merits of shipping an image with vetted runtimes baked in too, as I discussed a bit earlier with building your own base images. One last advantage or excuse, it was a stupidly easy way to ship and test things as ‘production intent’ and find odd issues earlier on in the development process.
Leave a Reply