Infrastructure Hosts Skill

Monitor and manage host and process infrastructure including CPU, memory, disk, network, and technology inventory.

What This Skill Does

Discover and inventory hosts across cloud and on-premise environments
Monitor host resource utilization (CPU, memory, disk, network)
Track process resource consumption and lifecycle
Analyze container and Kubernetes infrastructure
Discover services via listening ports
Manage technology stack versions and compliance
Attribute infrastructure costs by cost center and product
Validate data quality and metadata completeness
Plan capacity and detect resource saturation
Correlate infrastructure health across layers

When to Use This Skill

Use this skill when the user needs to:

Inventory: "Show me all Linux hosts in AWS us-east-1"
Monitor: "What hosts have high CPU usage?"
Troubleshoot: "Which processes are consuming the most memory?"
Discover: "What databases are running in production?"
Plan: "Track Kubernetes version distribution for upgrade planning"
Cost: "Calculate infrastructure costs by cost center"
Security: "Find all processes listening on port 22"
Compliance: "Identify hosts running EOL Java versions"
Quality: "Check data completeness for AWS hosts"
Optimize: "Find rightsizing candidates based on utilization"

Core Concepts

Entities

HOST - Physical or virtual machines (cloud or on-premise)
PROCESS - Running processes and process groups
CONTAINER - Kubernetes containers
NETWORK_INTERFACE - Host network interfaces
DISK - Host disk volumes

Metrics Categories

Host Metrics - dt.host.cpu.*, dt.host.memory.*, dt.host.disk.*, dt.host.net.*
Process Metrics - dt.process.cpu.*, dt.process.memory.*, dt.process.io.*, dt.process.network.*
Inventory - OS type, cloud provider, technology stack, versions
Cost - dt.cost.costcenter, dt.cost.product
Quality - Metadata completeness, version compliance

Alert Thresholds

CPU/Memory/Disk: 80% warning, 90% critical
Network: >70% high, >85% saturated
Disk Latency: >20ms bottleneck
Network Errors: Drop rate >1%, error rate >0.1%
Swap: >30% warning, >50% critical

Key Workflows

1. Host Discovery and Classification

Discover hosts, classify by OS/cloud, inventory resources.

smartscapeNodes "HOST"
| fieldsAdd os.type, cloud.provider, host.logical.cpu.cores, host.physical.memory
| summarize host_count = count(), by: {os.type, cloud.provider}
| sort host_count desc

OS Types: LINUX, WINDOWS, AIX, SOLARIS, ZOS

→ For cloud-specific attributes, see references/inventory-discovery.md

2. Resource Utilization Monitoring

Monitor CPU, memory, disk, network across hosts.

timeseries {
  cpu = avg(dt.host.cpu.usage),
  memory = avg(dt.host.memory.usage),
  disk = avg(dt.host.disk.used.percent)
}, by: {dt.smartscape.host}
| fieldsAdd host_name = getNodeName(dt.smartscape.host)
| filter arrayAvg(cpu) > 80 or arrayAvg(memory) > 80
| sort arrayAvg(cpu) desc

High utilization threshold: 80% warning, 90% critical

→ For detailed CPU analysis, see references/host-metrics.md
→ For memory breakdown, see references/host-metrics.md

3. Process Resource Analysis

Identify top resource consumers at process level.

timeseries {
  cpu = avg(dt.process.cpu.usage),
  memory = avg(dt.process.memory.usage)
}, by: {dt.smartscape.process}
| fieldsAdd process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(cpu) > 50
| sort arrayAvg(cpu) desc
| limit 20

→ For process I/O analysis, see references/process-monitoring.md
→ For process network metrics, see references/process-monitoring.md

4. Technology Stack Inventory

Discover and track software technologies and versions.

smartscapeNodes "PROCESS"
| fieldsAdd process.software_technologies
| expand tech = process.software_technologies
| fieldsAdd tech_type = tech[type], tech_version = tech[version]
| summarize process_count = count(), by: {tech_type, tech_version}
| sort process_count desc

Common Technologies: Java, Node.js, Python, .NET, databases, web servers, messaging systems

→ For version compliance checks, see references/inventory-discovery.md

5. Service Discovery via Ports

Map listening ports to services for security and inventory.

smartscapeNodes "PROCESS"
| fieldsAdd process.listen_ports, dt.process_group.detected_name
| filter isNotNull(process.listen_ports) and arraySize(process.listen_ports) > 0
| expand port = process.listen_ports
| summarize process_count = count(), by: {port, dt.process_group.detected_name}
| sort toLong(port) asc
| limit 50

Well-known ports: 80 (HTTP), 443 (HTTPS), 22 (SSH), 3306 (MySQL), 5432 (PostgreSQL)

→ For comprehensive port mapping, see references/inventory-discovery.md

6. Container and Kubernetes Monitoring

Track container distribution and K8s workload types.

smartscapeNodes "CONTAINER"
| fieldsAdd k8s.cluster.name, k8s.namespace.name, k8s.workload.kind
| summarize container_count = count(), by: {k8s.cluster.name, k8s.workload.kind}
| sort k8s.cluster.name, container_count desc

Workload Types: deployment, daemonset, statefulset, job, cronjob

Note: Container image names/versions NOT available in smartscape.

→ For K8s version tracking, see references/container-monitoring.md
→ For container lifecycle, see references/container-monitoring.md

7. Cost Attribution and Chargeback

Calculate infrastructure costs by cost center.

smartscapeNodes "HOST"
| fieldsAdd dt.cost.costcenter, host.logical.cpu.cores, host.physical.memory
| filter isNotNull(dt.cost.costcenter)
| fieldsAdd memory_gb = toDouble(host.physical.memory) / 1024 / 1024 / 1024
| summarize 
    host_count = count(),
    total_cores = sum(toLong(host.logical.cpu.cores)),
    total_memory_gb = sum(memory_gb),
    by: {dt.cost.costcenter}
| sort total_cores desc

→ For product-level cost tracking, see references/inventory-discovery.md

8. Infrastructure Health Correlation

Correlate host and process metrics for cross-layer analysis.

timeseries {
  host_cpu = avg(dt.host.cpu.usage),
  host_memory = avg(dt.host.memory.usage),
  process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}
| fieldsAdd
    host_name = getNodeName(dt.smartscape.host),
    process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(host_cpu) > 70
| sort arrayAvg(host_cpu) desc

Health scoring: Critical if any resource >90%, warning if >80%

→ For multi-resource saturation detection, see references/host-metrics.md

Common Query Patterns

Pattern 1: Smartscape Discovery

Use smartscapeNodes to discover and classify entities.

smartscapeNodes "HOST"
| fieldsAdd <attributes>
| filter <conditions>
| summarize <aggregations>

Pattern 2: Timeseries Performance

Use timeseries to analyze metrics over time.

timeseries metric = avg(dt.host.<metric>), by: {dt.smartscape.host}
| fieldsAdd <calculations>
| filter <thresholds>

Pattern 3: Cross-Layer Correlation

Correlate host and process metrics.

timeseries {
  host_cpu = avg(dt.host.cpu.usage),
  process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}

Pattern 4: Entity Enrichment with Lookup

Enrich data with entity attributes. After lookup, reference fields with lookup. prefix.

timeseries cpu = avg(dt.host.cpu.usage), by: {dt.smartscape.host}
| lookup [
    smartscapeNodes HOST
    | fields id, cpuCores, memoryTotal
  ], sourceField:dt.smartscape.host, lookupField:id
| fieldsAdd cores = lookup.cpuCores, mem_gb = lookup.memoryTotal / 1024 / 1024 / 1024

Tags and Metadata

Important Notes

Generic tags field is NOT populated in smartscape queries
Use specific tag fields: tags:azure[*], tags:environment
Use custom metadata: host.custom.metadata[*]

Available Tags

Azure Tags: tags:azure[dt_owner_team], tags:azure[dt_cloudcost_capability]
Environment: tags:environment
Custom Metadata: host.custom.metadata[OperatorVersion], host.custom.metadata[Cluster]
Cost: dt.cost.costcenter, dt.cost.product

→ For complete tag reference, see references/inventory-discovery.md

Cloud-Specific Attributes

AWS

cloud.provider == "aws"
aws.region, aws.availability_zone, aws.account.id
aws.resource.id, aws.resource.name
aws.state (running, stopped, terminated)

Azure

cloud.provider == "azure"
azure.location, azure.subscription, azure.resource.group
azure.status, azure.provisioning_state
azure.resource.sku.name (VM size)

Kubernetes

k8s.cluster.name, k8s.cluster.uid
k8s.namespace.name, k8s.node.name, k8s.pod.name
k8s.workload.name, k8s.workload.kind

→ For multi-cloud analysis, see references/inventory-discovery.md

Best Practices

Alerting

Use percentiles (p95, p99) for latency metrics
Use max() for resource limits
Use avg() for utilization trends
Set multi-level thresholds (warning at 80%, critical at 90%)

Time Windows

Real-time: 5-15 minute windows
Trends: 24 hours to 7 days
Capacity planning: 30-90 days

Query Optimization

Use filters early in the pipeline
Limit results with | limit N
Use specific entity types in smartscapeNodes
Aggregate before enrichment (lookup)

Data Quality

Validate metadata completeness (target >90%)
Check for duplicate host names
Ensure cost tag coverage
Monitor data freshness (lifetime.end)

Limitations and Notes

Smartscape Limitations

Container image names/versions NOT available in smartscape
Generic tags field NOT populated (use specific tag namespaces)
Process metadata varies by process type

Platform-Specific

dt.host.cpu.iowait available on Linux only
AIX has specific CPU metrics (entitlement, physc)
Inode metrics available on Linux only

Best Practices

Use getNodeName() to get human-readable names
Convert bytes to GB for readability: / 1024 / 1024 / 1024
Round aggregated values: round(value, decimals: 1)
Use isNotNull() checks before array operations

When to Load References

This skill uses progressive disclosure. Start here for 80% of use cases. Load reference files for detailed specifications when needed.

Load host-metrics.md when:

Analyzing CPU component breakdown (user, system, iowait, steal)
Investigating memory pressure and swap usage
Troubleshooting disk I/O latency
Diagnosing network packet drops or errors

Load process-monitoring.md when:

Analyzing process-level I/O patterns
Investigating TCP connection quality
Detecting resource exhaustion (file descriptors, threads)
Tracking GC suspension time

Load container-monitoring.md when:

Analyzing container lifecycle and churn
Tracking Kubernetes version distribution
Managing OneAgent operator versions
Planning K8s cluster upgrades

Load inventory-discovery.md when:

Performing security audits via port discovery
Implementing cost attribution and chargeback
Validating data quality and metadata completeness
Managing multi-cloud infrastructure

References

host-metrics.md - Detailed host CPU, memory, disk, and network monitoring
process-monitoring.md - Process-level CPU, memory, I/O, and network analysis
container-monitoring.md - Container inventory, Kubernetes versions, and operator management
inventory-discovery.md - Host/process discovery, technology inventory, cost attribution, and data quality

dt-obs-hosts

Safety Notice

Copy this and send it to your AI assistant to learn

Infrastructure Hosts Skill

What This Skill Does

When to Use This Skill

Core Concepts

Entities

Metrics Categories

Alert Thresholds

Key Workflows

1. Host Discovery and Classification

2. Resource Utilization Monitoring

3. Process Resource Analysis

4. Technology Stack Inventory

5. Service Discovery via Ports

6. Container and Kubernetes Monitoring

7. Cost Attribution and Chargeback

8. Infrastructure Health Correlation

Common Query Patterns

Pattern 1: Smartscape Discovery

Pattern 2: Timeseries Performance

Pattern 3: Cross-Layer Correlation

Pattern 4: Entity Enrichment with Lookup

Tags and Metadata

Important Notes

Available Tags

Cloud-Specific Attributes

AWS

Azure

Kubernetes

Best Practices

Alerting

Time Windows

Query Optimization

Data Quality

Limitations and Notes

Smartscape Limitations

Platform-Specific

Best Practices

When to Load References

Load host-metrics.md when:

Load process-monitoring.md when:

Load container-monitoring.md when:

Load inventory-discovery.md when:

References

Source Transparency

Related Skills

dt-dql-essentials

dt-app-dashboards

dt-obs-logs

dt-obs-tracing