Primary DB Node Very High CPU Usage

While integrating Cloud Director with Aria Operations to provide customers with resource usage reports and billing. We noticed that the CPU load on the primary node in the Postgres DB cluster was increasing. It gradually increased and remained constant for days. It could even reach 85 Ghz CPU loads. When looking at the load with the TOP command, you can see an abnormal number of selects hanging in the air all the time.

The temporary solution to the problem is to restart the VCD service on all nodes.

systemctl restart vmware-vcd

The high CPU utilization is caused by excessive stats-activity-pool threads, stemming from an overload of activity-related records in the database tables. It is suspected that this is caused by Aria Operations.

When contacting VMware support, they suggested this solution.

Resolution

NOTE: The following will require services to be stopped on all Cells, and Data to be removed from the Database.

  1. Take a backup of the Cloud Director database, for more details see Backup and Restore of VMware Cloud Director Appliance in the Cloud Director documentation. 
  2. Stop VCD service on all the cells.
    • /opt/vmware/vcloud-director/bin/cell-management-tool cell -i $(service vmware-vcd pid cell) -s
  3. Connect to the Database
    • sudo -i -u postgres psql vcloud
  4. Run the below commands to clean up the activity tables.
    • truncate table activity;
    • truncate table scheduled_activity_jobs;
    • truncate table activity_pc_queue;
    • truncate table activity_pc_event_queue;
    • truncate table fifo_activity_queue;
    • truncate table task_activity_queue;
    • truncate table vc_activity_queue;
    • truncate table activity_stats_queue;
    • truncate table activity_vsm_listener_queue;
  5. Start the vcd service on all the cells.
    • systemctl start vmware-vcd.service

5 responses to “Primary DB Node Very High CPU Usage”

  1. Justinas Avatar
    Justinas

    Hi,

    I hope you’re doing well. I wanted to check if truncating the tables resolved your issue permanently. We’re experiencing a similar problem, but we don’t have VCD and Aria for Operations integration. Unfortunately, truncating the tables only reduces primary VCD node CPU load for 1-2 months before it gradually returns to 100%.

    We’re having this issue since 2023 i believe after VCD update.

    Any insights or suggestions would be greatly appreciated.

    1. Aigars Avatar
      Aigars

      You are right, it was a temporary solution and the problem has not gone away. We have had countless requests to technical support, but VMware has not been able to fix this problem so far. We are currently on the latest version 10.6.1 and I can confirm that the problem is still here. When running the top command, you can see an abnormal number of SQL requests that just hang here and do nothing. At the moment, the only workaround is to restart the vcd service from time to time.

    2. Aigars Avatar
      Aigars

      What version VCD you are using?

      1. Justinas Avatar
        Justinas

        We are also on the latest version, 10.6.1. We’ve had multiple support cases with VMware, and currently have one active. This time, they acknowledged that something seems to be wrong and have escalated our ticket to their engineering team. Hopefully, they’ll be able to provide some useful insights or a resolution.

  2. Nhut Phan Avatar
    Nhut Phan

    Dear @Aigars

    >When looking at the load with the TOP command, you can see an abnormal number of selects hanging in the air all the time.

    Could you please share some screenshot about an abnormal task?
    Actually I have issue CPU very high but only for Application cells and must restart cell time to time. ( vCD version 10.5.1 )

Leave a Reply to Aigars Cancel reply

Your email address will not be published. Required fields are marked *

I’m Aigars

Welcome to Virtualisation Alley, my cozy corner of the internet dedicated to VMware. Here, I invite you to join me on a journey into virtual world. Let’s go.

Let’s connect