WaveFront is an APM tool and provides additional features beyond APM for monitoring your modern cloud native microservice applications, infrastructure, VMs, K8s clusters, and alerting in real-time, across multi-cloud, Kubernetes clusters, and on-prem at any scale. Traditional tools and environments make it challenging and time consuming to correlate data and get visibility thru a single plane of the glass or dashboard needed to resolve incidents in seconds in critical production environment. It’s a unified solution with analytics (including AI) that ingests visualizes, and analyses metrics, traces, histograms and span logs. So you can resolve incidents faster across cloud applications.
It can work with existing monitoring solutions open-sources like Prometheus, Grafana, Graphite
It has integration almost all popular monitoring solutions on VM and containers, SpringBoot, Kubernetes, messaging platforms, RabbiMQ, Databases etc.
It monitors containers and VMs stats
It captures all microservices APIs traces, usage and performance with topology view by it’s powerful service discovery features
In this blog, I will cover a quick introduction of TSM and a couple of use cases and real challenges which can be solved using this :
What is Tanzu Service Mesh (TSM)?
Radically simplify the process of connecting, protecting, and monitoring your microservices across any runtime and any cloud with VMware TanzuService Mesh. Provide a common policy and infrastructure for your modern distributed applications and unify operations for Application Owners, DevOps/SREs and SecOps without disrupting developer workflows.
Tanzu Service Mesh is K8s operator side microservice orchestration tool to manage service discovery, traffic, mTLS secure payload, rate limiting, telemetry, observability of VM, microservices and circuit breaker across multi-clouds. Open-source service mesh technologies like Istio exist to help overcome some of the challenges around building microservices such as service discovery, mutualTLS (mTLS), resiliency, and visibility. However, maintaining and managing a service mesh like Istio is challenging, especially at scale.
It provides unified management, global policies, and seamless connectivity across complex, multi-cluster mesh topologies managed by disparate teams. It provides app-level observability across services deployed to different clusters, complementing/integrating into modern observability tools you use or are considering.
TSM Global NameSpace Architecture
As of now, only this enterprise product has this powerful feature to provide a global namespace for multi K8s clusters across multi-clouds . Istio open source doesn’t provide this feature.
TSM use Cases
Service discovery for multi Kubernetes clusters in different namespaces or multi-cloud using GNS
Distributed Microservice Discovery on multi-cloud
Traffic Monitoring and API communication tracing
Logging and K8s Infra Monitoring with admin dashboard visualization
Rate Limiting with the help of Redis
Business Continuity (BI)
Developer is responsible to provide all service- related configuration thru boiler-plate code
Netflix OSS APIs (Eureka service discovery, Zuul API gateway, Ribbon- Load balancing, caching etc) , Hystrix (Circuit breaker) are legacy and no enterprise support, also its tightly coupled with application development source code
Open source Istio has no enterprise support as of now
VMware Tanzu Mission Control provides a single control glass of plane to easily provision and manage Kubernetes clusters and operate modern, containerized applications across multiple clouds and clusters. It works as a management cluster or Kubernetes control plane which provision and manage multi-clusters worker/data nodes including deploying and upgrading clusters, setting RBAC, security and other policies and configurations, monitor the health of clusters (VMs and K8s ) and provide the root cause of underlying production issues.
TMC Use Cases
Multi-cloud management of on-prem, public, hybrid cloud
Centralized Control Plane for provisioning K8s cluster for public cloud and on-prem
Centrally operates and manages all your Kubernetes clusters and applications at scale
App and service management
Enables developers with self-service access to Kubernetes for running and deploying applications
Manage security and configuration easily and efficiently through powerful policy engine like RBAC and inspection
Spring Cloud Task is complimentary of Spring Batch.
Spring Batch can be exposed as a CloudCloudTask.
Spring CloudTask makes life easy to run and Java/Spring microservice application that do not need the robustness of the Spring Batch APIs.
SpringCloudTask has good integration with Spring Batch and Spring Cloud Data Flow (SCDF). SCDF provides feature of batch orchestration, and UI dashboard to monitor SpringCloudTask.
In nutshell, all Spring Batch services can be exposed/registered as Spring Cloud Task to have better control, monitoring, and manageability.
Best practices for Spring Batch:
Use an external file system (Volume Services) for persistence of large files with PCF/PAS due to the file system limitations. Refer to this link.
Always use SCDF abstraction layer with UI dashboard to manage, orchestrate, and monitor Spring Batch applications.
Always use Spring CloudTask with Spring Batch for additional batch functionality.
Always register and implement vanilla Spring Batch applications as Spring CloudTask in SCDF.
Use Spring CloudTask when you need to run a finite workload via a simple Java micro-service.
For High Availability (HA), implement best suited horizontal scaling technique from the top scaling techniques based on the use cases on containers (K8s).
For large PROD system, use SCDF as an orchestration layer with Spring Cloud Task to manage large number of batches for large data sets.
App data and batch repo should live in the same schema for transaction synchronization.
Spring Batch Auto-scaling (both vertically and horizontally)
Vertical Scaling: No issue with that. H/w or POD size can be increased any time based on the usage of CPU and RAM for better performance and reliability. As you give the process more RAM, you can typically increase the chunk size which will typically increase overall throughput, but it doesn’t happen automatically.
Horizontal Scaling: There are popular techniques, watch this YouTube video for detail and refer this GitHub code –
Multi-threaded Steps – Each transaction/chunk executed by its separate threads, state is not persisted, only an option if u don’t need non-restartibility.
Parallel steps – Multiple independent steps run in parallel via threads.
Single JVM Async Item Writer/Item Processor. ItemProcessor calls are executed within a Java Future. The AsyncItemWriter unwrapps the result of the Future and passes it to a configured delegate to write.
Partitioning – Data is partitioned then assigned to n workers that are being executed either within the same JVM via threads or in external JVMs launched dynamically when using Spring Cloud Task’s partition extensions. A good option when restartability is needed.
Remote Chunking- Mostly I/O bound, sometimes when you need more processing power beyond the single JVM. It sends actual data remotely, only useful when processing is the bottleneck. Durable middleware is required for this option.
Spring Batch Orchestration and Composition
SCDF doesn’t watch the jobs. It just shares the same DB as the batch job does so you can view the results. Once a job is launched via SCDF, SCDF itself has no interaction with the job. You can compose and orchestrate jobs by drag and drop and set dependency between jobs, which jobs should run in parallel and which one in sequence, execution order can also be set for multiple jobs scheduling.
Achieve Active-Active operation for High Availability(HA) between two Data Centers/AZs
There are two standard ways:
Place a shard Spring Batch Job repository between two active-active DC/AZs. Parallel sync happens in the job repository database. App data and batch repo should in the same schema for better synchronization as noted above. Transaction isolation level set by default, so that one of active DC can run the job and other job should be failed when it tries to re-run the same job with same parameter.
Alerts and Monitoring of Spring Cloud Task and Spring Batch
Spring CloudTask includes Micrometer health check and metrics API out of the box.
Plain Prometheus is not suitable for jobs, because it uses pull mechanism and it won’t tell when job has finished or has some issues. If you want to use Prometheus for application metrics with Grafana visualization then follow this Prometheus rsocket-proxy API- https://github.com/micrometer-metrics/prometheus-rsocket-proxy