ZDM troubleshooting part 1: VM causes ZDM service to crash (plus fix)

This image has an empty alt attribute; its file name is image-2.png

Intro

Zero downtime migration (ZDM) is the ultimate solution to migrate your Oracle database to Oracle Cloud. I recently started using it quite a lot during  On-Prem to Exadata at Customer migrations. In my last blog post, I already shared tips about a ZDM installation error related to MySQL. This time, I’ll describe why my environment was crashing the zdm service every time an eval command was run and provide the fix. This is not a bug but just an unexpected behaviour due to a not so clean VM host.

Acknowledgement

I’d like to thank ZDM dev team that chimed in to tackle this tough one after I opened an SR. It was never heard of, which is why I decided to write about it .

1. My ZDM environment

VM

ZDM: 21.3 build

OS: Oracle Linux 8.4 kernel 5.4.17-2102.201.3.el8uek.x86_64

Prerequisites

After the installation , I just made sure that the connectivity was all set between: 
ZDM-Source/Target system (ssh/SQLNET)

Steps to Reproduce the error

Prepare a responsefile for a Physical online migration with the required parameters to reproduce the behaviour.
The parameters themselves are not important in our case
Responsefile

$ cat physical_online_demo.rsp | grep -v ^# TGT_DB_UNIQUE_NAME=TGTCDB MIGRATION_METHOD=ONLINE_PHYSICAL DATA_TRANSFER_MEDIUM=DIRECT PLATFORM_TYPE=EXACC ..More

ZDM service started:

$ zdmservice status --------------------------------------- Service Status --------------------------------------- Running: true Tranferport: Conn String: jdbc:mysql://localhost:8897/ RMI port: 8895 HTTP port: 8896 Wallet path: /u01/app/oracle/zdmbase/crsdata/velzdm2prm/security


Run ZDMCLI listphases

  • So far so good, no error thrown because there is nothing processed really in terms of checks

$ZDM_HOME/bin/zdmcli migrate database –sourcedb SRCDB -sourcenode srcHost -srcauth zdmauth -srcarg1 user:zdmuser -targetnode tgtNode -tgtauth zdmauth -tgtarg1 user:opc -rsp ./physical_online_demo.rsp -listphases zdmhostname: 2022-08-30T19:15:00.499Z : Processing response file ... pause and resume capable phases for this operation: " ZDM_GET_SRC_INFO ZDM_GET_TGT_INFO ZDM_PRECHECKS_SRC ZDM_PRECHECKS_TGT ZDM_SETUP_SRC ZDM_SETUP_TGT ZDM_PREUSERACTIONS ZDM_PREUSERACTIONS_TGT ZDM_VALIDATE_SRC ZDM_VALIDATE_TGT ZDM_DISCOVER_SRC ZDM_COPYFILES ZDM_PREPARE_TGT ZDM_SETUP_TDE_TGT ZDM_RESTORE_TGT ZDM_RECOVER_TGT ZDM_FINALIZE_TGT ZDM_CONFIGURE_DG_SRC ZDM_SWITCHOVER_SRC ZDM_SWITCHOVER_TGT ZDM_POST_DATABASE_OPEN_TGT ZDM_DATAPATCH_TGT ZDM_NONCDBTOPDB_PRECHECK ZDM_NONCDBTOPDB_CONVERSION ZDM_POST_MIGRATE_TGT ZDM_POSTUSERACTIONS ZDM_POSTUSERACTIONS_TGT ZDM_CLEANUP_SRC ZDM_CLEANUP_TGT"

Run ZDMCLI Eval command

  • The eval command will run critical prechecks that will validate the migration readiness and zdmcli service is more involved here. The job will first be scheduled before starting to execute the eval operation.

$ZDM_HOME/bin/zdmcli migrate database –sourcedb SRCDB -sourcenode srcHost -srcauth zdmauth -srcarg1 user:zdmuser -targetnode tgtNode -tgtauth zdmauth -tgtarg1 user:opc -rsp ./physical_online_demo.rsp –eval

Enter source database SRCDB SYS password: zdmhostname: 2022-08-30T20:15:00.499Z : Processing response file ... Operation "zdmcli migrate database" scheduled with the job ID "1".

ZDM service crashing

Error:
The eval command ends up crashing the service as soon as the execution kicks in.

Querying job status 

$ zdmcli query job -jobid 1 PRGT-1038: ZDM service is not running. Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Connection refused to host:zdmhost; nested exception is: java.net.ConnectException: Connection refused (Connection refused)]

ZDMService status: down

$ zdmservice status | grep Running Running:   false


Troubleshooting

Trace the ZDM service  

 Many things were tried to investigate where this behaviour came from amongst which tracing ZDM service

export SRVM_TRACE=TRUE export GHCTL_TRACEFILE=$ZDMBASE/srvm.trc $ZDMHOME/bin/zdmservice stop $ZDMHOME/bin/zdmservice start --> Re-Run the Eval Command

  • No luck, every time the service restarted it would crash again before I had the time to run another eval .

Upgrade/reinstall ZDM  

  I also tried an upgrade to the last build then a full reinstall but ZDM still crashed

$ /zdminstall.sh update oraclehome=$ZDM_HOME ziploc=./NewBuild/zdm_home.zip

The previous job being still in the queue when restarting the
zdmservice, I didn’t need to run anything to crash ZDM

Logs to check in ZDM  

Anytime you open an SR due to ZDM issues, the common location to fetch logs is ZDM_BASE using below cmd

$ find . -iregex '.*.(log.*|err|out|trc)$' -exec tar -rvf out.tar {} ;


Root cause

This was hard to uncover considering the issue was never encountered in the past but if we look down the zdm log.

$ view $ZDM_BASE/crsdata/`hostname`/rhp/zdmserver.log.0 … [DEBUG] [HASContext.<init>:129] moduleInit = 7 [DEBUG] [SRVMContext.init:224] Performing SRVM Context init. Init Counter=1 [DEBUG] [Version.isPre:804] version to be checked 21.0.0.0.0 major version to check against 10 [DEBUG] [Version.isPre:815]  isPre.java: Returning FALSE [DEBUG] [OCR.loadLibrary:339] 17999  Inside constructor of OCR [DEBUG] [SRVMContext.init:224] Performing SRVM Context init. Init Counter=2 [DEBUG] [OCR.isCluster:1061]  Calling OCRNative for isCluster() [CRITICAL] [OCRNative.Native]  JNI: clsugetconf retValue = 5 [CRITICAL] [OCRNative.Native]  JNI: clsugetconf failed with error code = 5 [DEBUG] [OCR.isCluster:1065]  OCR Result status = false [DEBUG] [Cluster.isCluster:xx] Failed to detect cluster: JNI: clsugetconf failed

We can see that some OCR checks were failing and a mismatch seem to have caused the failure but why?

What really happened ?

ZDM software (without delving into details) has bits of grid infrastructure core embedded within.

  • There is a reason why the Doc asks to make sure “Oracle Grid Infrastructure isn’t running on the ZDM service hostbefore the installation.

  • Here we have a failing check of the GI software version (crsctl query has releaseversion) where expected value is 21c but result is different. This made zdm crash when the eval was executed.

Why?


The ZDM VM had an oratab and bunch of other ocr files under /etc/oracle that were used to perform the CRS version check(ocr.loc). Which in turn messed up with ZDM service as CRS couldn’t be detected.

This image has an empty alt attribute; its file name is image-1.png

The vm had leftovers from an old DB and grid environment that weren’t cleaned.

    $ cat /etc/oracle/oratab +ASM:/u01/app/19.0.0/grid:N CDB1:/u01/app/oracle/product/19.0.0/dbhome_1:W

    Solution: drop the files


    We first moved the files out of /etc/oracle and the eval command worked without crashing the ZDM service.

    Conclusion

    • This took days to resolve, mainly because the provisioned VM was supposed to be a fresh image of Oracle Enterprise Linux 8, hence it never crossed my mind to check if grid configuration existed in it.

    • It all goes to show why Oracle strongly recommends to have ZDM installed in a dedicated host with no previous grid installation

    • Hope this will help anyone who runs into the same error and reminds users to double check their Environment .    

            Thank you for reading

    Don't miss a Bit!

    Join countless others!
    Sign up and get awesome cloud content straight to your inbox. 🚀

    Start your Cloud journey with us today .