Monitor your apps and infrastructure with WaveFront (beyond APM): Use Cases and Solutions

In this blog, I will cover a quick introduction of WaveFront/Tanzu Observability (TO) and a couple of use cases and real challenges which can be solved using this:

What is WaveFront Tanzu Observability (TO)?

Monitor full-stack applications to cloud infrastructures with metrics, traces, span logs, and analytics. It provides extra features beyond any other APM tool

https://tanzu.vmware.com/observability

WaveFront is an APM tool and provides additional features beyond APM for monitoring your modern cloud native microservice applications, infrastructure, VMs, K8s clusters, and alerting in real-time, across multi-cloud, Kubernetes clusters, and on-prem at any scale. Traditional  tools and environments make it challenging and time consuming to correlate data and get visibility thru a single plane of the glass or dashboard needed to resolve incidents in seconds in critical production environment. It’s a unified solution with analytics (including AI) that ingests visualizes, and  analyses metrics, traces, histograms and span logs.  So you can resolve incidents  faster across cloud applications.

Features:

  • It can work with existing monitoring solutions open-sources like Prometheus, Grafana, Graphite
  • It has integration almost all popular monitoring solutions on VM and containers, SpringBoot, Kubernetes, messaging platforms, RabbiMQ, Databases etc.
  • It monitors containers and VMs stats
  • It captures all microservices APIs traces, usage and performance with topology view by it’s powerful service discovery features
  •  It maintains versions of charts and dashboards
  • Currently it stores and archive old monitoring data for analytics purposes

High Level Technical Architecture

WaveFront use cases:

  • Multicloud visibility (mostly data center, moving to public cloud)
  • Application monitoring (+ tooling for Dev and Ops visibility)
  • Service performance and reliability optimization (assess-verify)
  • Observability and diagnostics of multi-cloud and on-prem K8s clusters
  • Business service performance & KPIs
  • App metrics: from New Relic, Prometheus and Splunk
  • Multicloud metrics: from vSphere, AWS, Kubernetes
  • All data center metrics: from compute, network, storage
  • Reliability and high availability operations
  • App and Infrastructure monitoring , analytics dashboards
  • Auto alerting mechanism for any production bug or high usage of infrastructure (CPU, RAM, Storage)
  • Instrument and monitor your Spring Boot application in Kubernetes
  • Other Tanzu products monitoring
  • System-wide monitoring and incident response – cut MTTR
  • Shared visibility across biz, app, cloud/infra, device metrics
  • IoT optimization with automated analytics on device metrics
  • Microservices monitoring and troubleshooting
  • Accelerated anomaly detection
  • Visibility across Kubernetes at all levels
  • Solving cardinality limitations of graphite
  • Easy adoption across hundreds of developers
  • System-wide monitoring and incident response – cut MTTR
  • Shared visibility across biz, app, cloud/infra, device metrics
  • IoT optimization with automated analytics on device metrics
  • AWS infrastructure visibility (cost and performance)
  • Kubernetes monitoring
  • Visualizing serverless workloads
  • Solving Day 2 Operations for production issues and DevOps/DevSecOps
  • Finding hidden problems early and increase SLA for service ticket resolution
  • Application and microservices API monitoring
  • Performance analytics
  • Monitoring CI/CD like Jenkins Environment with Wavefront

Live WaveFront Dashboard

References

Generic Demo Video -1

MicroServices Observability with WaveFront Demo Video -2

Tanzu Service Mesh (TSM) based on Istio : Use Cases & Solutions

In this blog, I will cover a quick introduction of TSM and a couple of use cases and real challenges which can be solved using this :

What is Tanzu Service Mesh (TSM)?

Radically simplify the process of connecting, protecting, and monitoring your microservices across any runtime and any cloud with VMware Tanzu Service Mesh. Provide a common policy and infrastructure for your modern distributed applications and unify operations for Application Owners, DevOps/SREs and SecOps without disrupting developer workflows.

https://www.vmware.com/in/products/tanzu-service-mesh.html

Tanzu Service Mesh is K8s operator side microservice orchestration tool to manage service discovery, traffic, mTLS secure payload, rate limiting, telemetry, observability of VM, microservices and circuit breaker across multi-clouds. Open-source service mesh technologies like Istio exist to help overcome some of the challenges around building microservices such as service discovery, mutualTLS (mTLS), resiliency, and visibility. However, maintaining and managing a service mesh like Istio is challenging, especially at scale.


It provides unified management, global policies, and seamless connectivity across complex, multi-cluster mesh topologies managed by disparate teams. It provides app-level observability across services deployed to different clusters, complementing/integrating into modern observability tools you use or are considering.

TSM Global NameSpace Architecture

As of now, only this enterprise product has this powerful feature to provide a global namespace for multi K8s clusters across multi-clouds . Istio open source doesn’t provide this feature.

TSM use Cases

  • Service discovery for multi Kubernetes clusters in different namespaces or multi-cloud using GNS
  • Distributed Microservice Discovery on multi-cloud
  • Traffic Monitoring and API communication tracing
  • Logging and K8s Infra Monitoring with admin dashboard visualization
  • Rate Limiting with the help of Redis
  • Business Continuity (BI)
  • Developer is responsible to provide all service- related configuration thru boiler-plate code
  • Secure Payload
  • Netflix OSS APIs (Eureka service discovery, Zuul API gateway, Ribbon- Load balancing, caching etc) , Hystrix (Circuit breaker) are legacy and no enterprise support, also its tightly coupled with application development source code
  • Open source Istio has no enterprise support as of now
  • Visibility for DevOps and DevSecOps

References

  1. Doc – https://docs.pivotal.io/pks/1-7/nsxt-service-mesh.html
  2. Public doc- https://tanzu.vmware.com/service-mesh

Demo  for Microservices:

Tanzu Mission Control (TMC) for multi-cloud: Use Cases & Solutions

In this blog, I will cover a quick introduction of TMC and a couple of use cases and real challenges which can be solved using this :

What is Tanzu Mission Control (TMC)?

Operate and secure your Kubernetes infrastructure and modern apps across teams and multi clouds (on-prem, private, public, hybrid Kubernetes clusters.

https://tanzu.vmware.com/mission-control

VMware Tanzu Mission Control provides a single control glass of plane to easily provision and manage Kubernetes clusters and operate modern, containerized applications across multiple clouds and clusters. It works as a management cluster or Kubernetes control plane which provision and manage multi-clusters worker/data nodes including deploying and upgrading clusters, setting RBAC, security and other policies and configurations, monitor the health of clusters (VMs and K8s ) and provide the root cause of underlying production issues.

TMC Use Cases

  • Multi-cloud management of on-prem, public, hybrid cloud
  • Centralized Control Plane for provisioning K8s cluster for public cloud and on-prem
  • Centrally operates and manages all your Kubernetes clusters and applications at scale
  • App and service management
  • Enables developers with self-service access to Kubernetes for running and deploying applications
  • Manage security and configuration easily and efficiently through powerful policy engine like RBAC and inspection

References

Demo Video

Scale Spring Batch, comparison with Spring Cloud Task & best practices of Spring Batch!

Disclaimer: This blog content has been taken from my latest book:

“Cloud Native Microservices with Spring and Kubernetes”

Comparison Spring Cloud Task vs Spring Batch

  • Spring Cloud Task is complimentary of Spring Batch.
  • Spring Batch can be exposed as a Cloud Cloud Task.
  • Spring Cloud Task makes life easy to run and Java/Spring microservice application that do not need the robustness of the Spring Batch APIs.
  • Spring Cloud Task has good integration with Spring Batch and Spring Cloud Data Flow (SCDF). SCDF provides feature of batch orchestration, and UI dashboard to monitor Spring Cloud Task.
  • In nutshell, all Spring Batch services can be exposed/registered as Spring Cloud Task to have better control, monitoring, and manageability.

Best practices for Spring Batch:

  1. Use an external file system (Volume Services) for persistence of large files  with PCF/PAS due to the file system limitations. Refer to this link.
  2. Always use SCDF abstraction layer with UI dashboard to manage, orchestrate, and monitor Spring Batch applications.
  3. Always use Spring Cloud Task with Spring Batch for additional batch functionality.
  4. Always register and implement vanilla Spring Batch applications as Spring Cloud Task in SCDF.
  5. Use Spring Cloud Task when you need to run a finite workload via a simple Java micro-service.
  6. For High Availability (HA), implement best suited horizontal scaling technique from the top scaling techniques based on the use cases on containers (K8s).
  7. For large PROD system, use SCDF as an orchestration layer with Spring Cloud Task to manage large number of batches for large data sets.
  8. App data and batch repo should live in the same schema for transaction synchronization.

 Spring Batch Auto-scaling (both vertically and horizontally)

  • Vertical Scaling: No issue with that. H/w or POD size can be increased any time based on the usage of CPU and RAM for better performance and reliability. As you give the process more RAM, you can typically increase the chunk size which will typically increase overall throughput, but it doesn’t happen automatically.
  • Horizontal Scaling: There are  popular techniques, watch this YouTube video for detail and refer this GitHub code –
  1. Multi-threaded Steps – Each transaction/chunk executed by its separate threads, state is not persisted, only an option if u don’t need non-restartibility. 
  2. Parallel steps – Multiple independent steps run  in parallel via threads.
  3. Single JVM Async Item Writer/Item Processor. ItemProcessor calls are executed within a Java Future. The AsyncItemWriter unwrapps the result of the Future and passes it to a configured delegate to write.
  4. Partitioning – Data is partitioned then assigned to n workers that are being executed either within the same JVM via threads or in external JVMs launched dynamically when using Spring Cloud Task’s partition extensions. A good option when restartability is needed.
  5. Remote Chunking- Mostly I/O bound, sometimes when you need more processing power beyond the single JVM. It sends actual data remotely, only useful when processing is the bottleneck. Durable middleware is required for this option.

 Spring Batch Orchestration and Composition

SCDF doesn’t watch the jobs. It just shares the same DB as the batch job does so you can view the results. Once a job is launched via SCDF, SCDF itself has no interaction with the job. You can compose and orchestrate jobs by drag and drop and set dependency between jobs, which jobs should run in parallel and which one in sequence, execution order can also be set for multiple jobs scheduling.

 Achieve Active-Active operation for High Availability(HA) between two Data Centers/AZs

There are two standard ways:

  1. Place a shard Spring Batch Job repository between two active-active DC/AZs. Parallel sync happens in the job repository database. App data and batch repo should in the same schema for better synchronization as noted above. Transaction  isolation level set by default, so that one of active DC can run the job and other job should be failed when it tries to re-run the same job with same parameter. 
  2. Spring Cloud Task has this built-in functionality to restrict Spring Cloud Task Instances- https://docs.spring.io/spring-cloud-task/docs/2.2.3.RELEASE/reference/#features-single-instance-enabled

 Alerts and Monitoring of Spring Cloud Task and Spring Batch

  • Spring Cloud Task includes Micrometer health check and metrics  API out of the box.
  • Plain Prometheus is not suitable for jobs, because it uses pull mechanism and it won’t tell when job has finished or has some issues. If you want to use Prometheus for application metrics with Grafana visualization then follow this Prometheus rsocket-proxy API- https://github.com/micrometer-metrics/prometheus-rsocket-proxy

More References: