Many IT operation engineers approached me and said that they have reached at the dead end with their existing skills and can’t efficiently work anymore with their existing developer or sysadmin skills. Even engineers having DevOps and SRE exposure on advanced platform hosted on Virtualized, Cloud and hybrid environment are not enough to cope up this complex distributed micro service world as challenges are hidden in new technological setup where every business expecting scalability, resiliency and reliability and would like to use best in class available technological stack like FaaS (Function as a Service), Kubernetes(k8s), OpenShift, AWS ECS/EKS and other similar orchestration and container technologies available on different platform. Similarly, observabilities and monitoring challenges were also discussed by CNCF organization that helping IT industry to remain focused by creating sustainable ecosystem and fosters a community around a constellation of high-quality projects that orchestrate containers as part of a microservices architecture.
Ref: CNCF survey
Software applications are increasingly complicated yet sophisticated. Highly integrated systems are the new normal these days. Cloud software engineering is one hot area, drawing the attention of many software engineers across the globe. There are public, private, and hybrid clouds. Recently we hear more about edge / fog clouds. Still there are traditional IT environments and it is going to be the hybrid world.
Monitoring application in Sync model is easy to trace but when you have Aync model having message bus system in between then it creates different challenges to track timely outcome of results.
Many monitoring and logging solutions are available to get insight into your infrastructure like Nagios, Zabbix, ELK, Dynatrace, NewRelic, SignalFX, CheckMK, Logentries, Splunk and many more but still we are unable to find out the actual problem of many incidents. I know some of us will come up by saying you can do X or Y to get insight but believe me you need 100% dedicated staff focusing on your observability challenge with newer troubleshooting tool stack and even it is not easy for your team as on daily basis there are N number of random request that operation staff have to handle like decommissioning old stack, fixing bug in script, patching or replacing new instances or container images, recovering from technical debts, choosing new monitoring or platform stack, working on cost reduction projects and in this hustle and bustle only well focused engineers/leaders can help team who can give confidence and always available to give new vision to team and giving them response that they can still win the race and beat this technical challenges.
Some observability tools like Linkerd, OpenTracing and Zipkin are coming up to get the trace of the end to end web/API calls and competing to win the heart of operation engineer so that they can come up from darkness to get some enlighten to solve hybrid and distributed infrastructure problems. We know it’s not goofed up but a step by step technological improvement that industry is moving, and I am positive that industry is giving us new hope and will open new ways from the dead end by breaking observability problem walls.
Some experience with SaaS solution providers:
First, I experienced different SaaS solution provider like a US based Retail consulting organization where they were having big data analysis challenges in dedicated and shared delivery model and cost reduction was one of the challenge and we tried to solve this by hosting customer on the right cloud environment in hybrid setup model by keeping multiple customer constraints as different challenge like we tried with hosting them on VMware vSphere platform, Rackspace IaaS, VMware vCloud for raw compute and AWS where each of these platforms were having their own cost benefits .
Second. Another US based tax calculation SaaS provider where they were working toward reproducible consistent infrastructure to be consistent on their services version to get better customer experience with more stability and amalgamation of DevOps and SRE skills were much needed.
Third, Another US based SaaS security solution provider that has complex distributed sync and async micro service infrastructure where monitoring and observability is one of the challenge where team is focusing on solving it by having service mesh infrastructure in form of implementing Envoy as data plain proxy and has future plan to implement Istio as control plain and trying to get insight using opensource tracing solution and using existing traditional metric and logging tools as problem solving arm.
If I summarize my experience from above, we see 3 things:
- Right Infrastructure size and model by consideration of Cost factor from multiple cloud provider (IaaS/PaaS/SaaS provider selection having Cost factor in mind)
- Stable reproducible infrastructure (DevOps Skills)
- Monitoring and Observability for reliable services (SRE skills)
So, what are the solution for our fellow engineers those who are in Architect/SRE/DevOps roles? Sit down and wait for solutions from big daddy like Google, Amazon, IBM or similar other open source projects those who are working towards solving this problem? Or take a different path with their focused mind to fix problems under their scope with their custom solution (python scripts, 3rd party Tools from Lyft, Netflix etc.). Challenges are different for an organization at different scale like large or SaaS solution providers has resiliency, reliability, availability and scalability challenges whereas all small organization are keeping eye to become big and would also like to adapt SaaS solution model as it provides certain benefits.
I prefer to try with AI and ML in this domain as above mentioned challenges give us the opportunity to implement these new skills to have better observable, identifiable IT world. Some vendors are already trying AI and ML like Loom system in centralized logging, SignalFX Microservice APM in monitoring etc.
I know it’s challenging but “Unlearn” is the new word that we must try. I am not joking and its reality that traditional skill will only help you to work on 60% challenges where 40% challenges can’t be handled by traditional skill set, I am counting DevOps and SRE in 60%, yes again I am not joking as you will say DevOps and SRE are new fields to handle organization challenges and organizations are using these people to solve their challenges. I don’t think currently industry including AI and ML as part of job description of operation engineer(including architects) but I am in favor that companies should train their staff with these new skill mainly focused on IT operational part.
I am not leaving you in dark so try with Programming language like Python’s and its libraries like
And tools like MetaBase as Operation analytics and Istio and Envoy as control and data plain in service mesh architecture, Linkerd as tracing solution and hack around it and get trained yourself and devote yourself to fight against newer challenges.
“A warrior can’t win without his arsenal but can fight boldly”. I hope you got my point and I am open to get your point of view about how your organization solving observability, monitoring, tracing challenges in sync and async distributed micro service model so people around us can get a holistic view.