Failover Fails

Tag: esxi

  • Clearing a Non-Critical disk from OpenManage or: The Stupid Case of Installing a non-vendor Disk in an EOL Server

    Once upon a time, there was a client with a very EOL PowerEdge ESXi host. This host had a great big RAID10 array. This machine had served them well for years and years. Until one day, a big, bad wolf came by and crashed a disk!

    “Oh, my!” The client gasped. “We can’t have any data loss! Whatever shall we do?”

    “Not to fear!” A voice boomed from sight unseen.

    Suddenly, a figure with bad posture appeared. Lumbering along, shoulder bag in tow, his eye bags dragging across the floor.

    The client stared in awe at the half-ape half-man creature standing before them. “Who- Who are you?” They asked, barely able to stand upright.

    “I am the savior of your servers, the bane of vendors. I am… your sysadmin!” He attempts to strike a pose, but merely belches.

    Arms flailing, he works with impeccable speed. Logging onto the iDRAC, confirming the failed disk, unsheathing it from its backplane like Excalibur from its stone. In the blink of an eye (10 minutes), the new disk was installed, and the RAID was rebuilding.

    The sysadmin, content with his work, signed into OpenManage, which shows the rebuild already at 20% complete!

    Later that night, after finishing some scheduled work, he decides to check on the progress of the rebuild. It went incredibly fast! Until the true villain of this story appeared.

    “YOU THOUGHT YOU COULD FOOL ME?” The server hardware shouted. “THIS DISK WAS NOT PURCHASED FROM DELL.” With the speed of a bullet, the hardware ejected its disk directly through the sysadmin’s monitor, knocking him to the ground.

    “Oh golly gee!” The sysadmin startled. This is the most emotion that he has shown in years. Is our hero down for the count? Not remotely, but he couldn’t do this alone.

    The sysadmin turns to his only friend, the one who is there for him at all hours. Day or night, this was truly his closest companion. A true scholar of all fields. Jack of all trades, master of all.

    “openmanage disk non-critical” He furiously types into Google.

    “Not a genuine part!?” He exclaims.

    MUAHAHAHAHAHAHA” The hardware was mocking him. The sysadmin flashed back to kindergarten. Would he allow this to stand? Not remotely.

    “openmanage clear non-critical status” He posed the question to his friend.

    “Fear not, weary sysadmin! Simply edit stsvc.ini and change NonDellCertifiedFlag=yes to no!” Reddit replied to him.

    “Excellent, where do I find it?”

    “Here, take this!” Reddit, Google, and ToughTechSite threw elixirs of filepaths at him.

    But there was a problem. The filepaths provided were for Windows, and CentOS, and different versions of ESXi oh my! Google was his closest friend, but there are some journeys that a sysadmin needs to make alone.

    He cracked his knuckles. He leaned towards his laptop screen. He opened Putty, and he did what he does best. He looked around and had no idea what he was doing.

    “grep, cd, ls, find, cat.” He typed into the console. Getting nowhere, he searched harder and harder for solutions from his friend. Any crumb of knowledge that could sustain him for this journey, he would take.

    The sysadmin noticed a small cave entrance off to the side, as he entered, a sign read “/etc/cim/dell/srvadmin/srvadmin-storage/” Inside, he finally found it. The ever-elusive stsvc.ini.

    As he reached for the file, quill in hand, he was faced with an ancient language. In order to edit the file, the sysadmin could not rely on the modern comforts of nano. He had to reach deep within, and use vim.

    “UGH!” The sysadmin cried as he pulled up a cheat sheet. Navigating the cursor to the line, he deleted three letters of Y, E, and S, and replaced them with two of his favorite. N. O.

    “GAAAAAAAAAAAH” He heard the hardware scream in the distance. “BUT YOU HAVE NOT DEFEATED ME YET, YOU MUST RESTART THE SERVICE!”

    The sysadmin winced, and reviewed the sacred texts provided by his companion Google, he searched high and low in the dense forest of the ESXi CLI, but nowhere in sight was srvadmin-services.sh.

    He began doubting himself. Had he installed OpenManage wrong? But it was no matter. The client needed him. And he needed to go to bed.

    He called to his companion, and after searching and searching the scholar’s knowledge, he found only one solution. The sysadmin sighed, he resigned himself to his fate. After saying a prayer, he spoke the command that would fell this foul beast for all.

    “SERVICES.SH RESTAAAAAAAAAAAAAAAAAAAAAAAAAART!”

    The world went black. There was nothing. There was no hardware. There was no shell. There was no OpenManage. All surrounding him was darkness.

    The sysadmin quivered in fear. He feared that he had inadvertently taken down production at 9PM on a Friday. How could he have forgotten about the most basic of his clan’s principles? No changes on Fridays, you will spend all weekend fixing them.

    NOOOOOOOOOOOOOOOOOOOO!!!!!!!!!” The cry rang out across the earth and the heavens.

    A realization hit.

    The sysadmin opened his eyes. The VM he was connected to was still active, so the host had indeed not gone down.

    Drenched in sweat, he was now fully awake, his eye bags had retracted to a normal state, bouncing against his knees. He trudged out of the forest, and returned to OpenManage.

    Disk status: Normal.

    Breathing a sigh of relief, he curses the vendor and hangs up his shoulder bag. Another day in the life of a sysadmin, he remains undefeated.

    The actual tech stuff

    After replacing the disk, I saw the RAID was rebuilding and let it rebuild.

    Turns out, the disk was showing Non-Critical status, some research indicating that it was due to the disk not being a genuine Dell part. A part that we couldn’t RMA, because the server was EOL.

    In my previous escapades with this server, I had installed OpenManage on the host. My research led me to ToughTechSite, who had run into this exact same issue before.

    https://toughtechsite.wordpress.com/2017/12/03/the-case-of-non-certified-physical-drives-causing-warnings-in-dell-openmanage-omsa/comment-page-1/

    It was their instruction that set me on the right path, but sadly they only had instructions for Windows and CentOS, not ESXi.

    The file in this instance was located at /etc/cim/dell/srvadmin/srvadmin-storage/stsvc.ini.

    The file needs to have the following line changed from yes to no:

    ;nonDellCertified flag for blocking all non-dell certified alerts.

    NonDellCertifiedFlag=yes

    This started a whole other adventure, as the next step is to restart the OpenManage services. The installation on this host just didn’t have srvadmin-services.sh. Anywhere. At all. I scanned the whole filesystem as root. Nada.

    Maybe I messed up the installation, I’m not sure.

    Instead, you can just run the below command to restart all of the management services on the host without a reboot:

    services.sh restart

    After staring at the Putty console, it finished. My OpenManage session was killed, and then it lived! The disk was showing as good, and the RAID was no longer showing as degraded.