Asm Health Checker Found 1 New Failures [patched]
The message "ASM Health Checker found 1 new failures" is a critical warning often found in Oracle Automatic Storage Management (ASM) alert logs. It typically signals that the system has detected a significant issue—such as disk corruption or a communication breakdown—that could lead to a diskgroup being forcibly dismounted.
Here is a story of a "typical" Friday night in the life of a Database Administrator (DBA) facing this error. The Friday Night Ghost in the Machine
It was 4:45 PM on a Friday. The office was thinning out, and Leo was already thinking about his weekend plans when his terminal began to scroll with red text. The monitoring system had just spat out a single, chilling line: ASM Health Checker found 1 new failures
Leo’s heart sank. In the world of Oracle ASM, "1 new failure" is rarely just one thing; it's the tip of an iceberg.
The Investigation BeginsHe dove into the alert logs. Just seconds before the health checker tripped, he saw a flurry of ORA-15130 errors: diskgroup "DATA" is being dismounted. This was the DBA equivalent of a ship taking on water.
He checked the shared storage. "It's always the hardware," he muttered. But the storage arrays looked green. He then checked the ASM Filter Driver, remembering a bug involving 4k sector drives that had caused similar headaches for peers in the past. The DiscoveryLeo ran a quick check of the diskgroup status: Diskgroup: DATA Status: DISMOUNTED Cause: "Insufficient number of disks discovered".
It turned out a routine disk add operation from earlier that morning had gone sideways. A subtle corruption on metadata block 40 had been lying in wait. When the ASM rebalance operation hit that specific block, the Health Checker—a silent guardian that usually stays in the background—spotted the anomaly and pulled the emergency brake to prevent further data loss.
The ResolutionThe "1 new failure" wasn't a death sentence, but it required surgery. Leo had to:
The alert "ASM Health Checker found 1 new failures" is a critical notification from Oracle's Automatic Storage Management (ASM) health monitoring system. It typically appears in the ASM alert logs or via automated email notifications when a storage-related incident is detected. Failure Overview
This specific message indicates that the Fault Diagnosability Infrastructure has identified a new incident in the Automatic Diagnostic Repository (ADR). While "1 new failure" is a generic count, it often points to one of the following underlying issues:
Disk Group Instability: A disk may have failed, leading to a loss of redundancy or a disk group being forced to dismount.
Metadata Corruption: Corruption in ASM metadata blocks (typically within the first 250 blocks) detected during routine operations or rebalancing. asm health checker found 1 new failures
Rebalance Failures: An error occurring during the addition or removal of disks, often accompanied by background process (ARB0) alerts.
Resource State Changes: CRS (Cluster Ready Services) resources moving to an INTERMEDIATE or OFFLINE state due to storage latency or connectivity issues. Immediate Diagnostic Actions
To identify the exact cause, execute the following steps within your environment:
Check the ADRCI Utility:Use the ADR Command Interpreter (ADRCI) to list the details of the specific failure. adrci> list failure Use code with caution. Copied to clipboard
This command provides a unique Failure ID and a description of the problem.
Inspect ASM Alert Logs:Locate the log file (usually in the trace directory of your Oracle Base) to see the events leading up to the "1 new failure" message. Look for: ORA-15xxx errors (ASM-specific).
SUCCESS: ALTER DISKGROUP... followed by immediate GMON dumping or failure notes.
Run Data Recovery Advisor:If the failure involves data loss or disk group mounting issues, use RMAN to get a repair recommendation: RMAN> list failure; RMAN> advise failure; Use code with caution. Copied to clipboard
Query V$ Views:Verify the status of your disks and current operations:
Disk Status: SELECT name, path, mount_status, header_status, state FROM v$asm_disk;
Active Operations: SELECT operation, state, est_minutes FROM v$asm_operation; Common Remediation Steps KB88485 - My Oracle Support The message "ASM Health Checker found 1 new
ASM Health Checker Found 1 New Failure: What It Means and How to Resolve It
The Automatic Storage Management (ASM) health checker is a crucial tool in Oracle databases that monitors the health and integrity of the storage infrastructure. When the ASM health checker reports a new failure, it's essential to understand the implications and take corrective actions to prevent data loss or system downtime. In this blog post, we'll discuss what an ASM health checker failure means, how to investigate the issue, and steps to resolve it.
What does an ASM health checker failure mean?
When the ASM health checker detects a problem, it logs an error message indicating that a failure has been detected. The message may look like this:
"ASM health checker found 1 new failure"
This message indicates that the ASM health checker has detected a single failure in the storage system. The failure could be related to various issues, such as:
- Disk errors or corruption
- Connectivity problems between the database server and storage
- Insufficient disk space or quota issues
- ASM configuration errors
Investigating the ASM health checker failure
To investigate the failure, follow these steps:
- Check the ASM alert log: The ASM alert log provides detailed information about the failure, including the error message, timestamp, and affected disk group. You can find the alert log in the
$ORACLE_BASE/diag/asm/+ASM/<instance_name>/tracedirectory. - Run the
asmcmdcommand: Theasmcmdcommand-line tool provides a comprehensive view of the ASM configuration and status. Runasmcmdwith thelsdgoption to list the disk groups and their status:asmcmd ls dg - Check the disk group status: Use the
asmcmdcommand with thedgoption to check the status of the affected disk group:asmcmd dg <disk_group_name>
Resolving the ASM health checker failure
Once you've identified the root cause of the failure, take corrective actions to resolve the issue:
- Replace a failed disk: If the failure is due to a disk error, replace the disk and re-add it to the ASM disk group.
- Check and correct connectivity: Verify that the storage connections are stable and functioning correctly.
- Free up disk space: If the failure is due to insufficient disk space, free up space by deleting unnecessary files or expanding the disk group.
- Reconfigure ASM: If the failure is due to an ASM configuration error, reconfigure ASM with the correct settings.
Best practices to prevent ASM health checker failures Disk errors or corruption Connectivity problems between the
To minimize the likelihood of ASM health checker failures:
- Regularly monitor ASM alerts: Regularly check the ASM alert log and respond promptly to any errors or warnings.
- Perform routine maintenance: Regularly perform routine maintenance tasks, such as checking disk space and replacing failed disks.
- Test and validate ASM configurations: Test and validate ASM configurations to ensure they are correct and optimal.
By understanding the causes of ASM health checker failures and taking proactive steps to prevent them, you can ensure the reliability and performance of your Oracle database storage infrastructure.
An "ASM health checker found 1 new failures" message in Oracle (AHF/ORAchk) signals a logged incident in the Automatic Diagnostic Repository (ADR), often caused by disk connectivity issues, failed rebalances, or metadata corruption. Immediate investigation requires using ADRCI to identify the specific incident and checking V$ASM_DISK for failed or dropped disks. Detailed diagnostic procedures are available from Oracle Help Center at Oracle Help Center.
Recommended Troubleshooting Steps
Operations teams are advised to follow this runbook:
-
Isolate the Failed Component: Run the health checker in verbose mode to identify exactly which assertion failed.
- Command Example:
asm-healthcheck --verbose --target=all
- Command Example:
-
Check System Logs: Grep the ASM logs for the timestamp of the failure.
- Command Example:
grep "ERROR" /var/log/asm/healthcheck.log | tail -n 20
- Command Example:
-
Verify Network Connectivity: Ensure the ASM instance can reach its dependent services (LDAP, Database, API Gateway). A single ping failure can trigger this alert.
-
Restart Services (If Necessary): If the failure persists and impacts user traffic, initiate a graceful restart of the ASM service node.
- Command Example:
systemctl restart asm-service
- Command Example:
1. Stale or Offline ASM Disks
The most frequent culprit. One disk in a disk group has been taken offline due to:
- OS path change (e.g.,
/dev/sddbecomes/dev/sdeafter reboot) - Multipath failure (e.g.,
multipath -llshows faulty path) - Physical disk or controller failure
Scenario D: Compatibility Mismatch
Error example: Attribute 'compatible.asm' value '19.0.0.0.0' higher than software version '12.2.0.1.0'
Fix:
ALTER DISKGROUP DATA SET ATTRIBUTE 'compatible.asm' = '12.2';
Executive Summary
At [Insert Timestamp], the ASM (Application Service Manager / Audit Session Manager / Android Studio Metrics) Health Checker routine completed its scheduled run. The monitoring utility flagged 1 new failure within the environment.
Unlike transient warnings, this specific failure indicates a state change from "Healthy" to "Unhealthy" for a specific component, requiring immediate triage to prevent potential service disruption or data integrity issues.
