Solving MongoDB Upgrade Error: “trap invalid opcode ip”

Table of Contents

Introduction

While working with one of our clients, we were facing an issue during the upgrade process from MongoDB 4.4 to MongoDB 5.0. After installing the new MongoDB 5.0 packages, we found out the MongoDB services couldn’t be started at all and were generating trap exceptions in dmesg. This simply means the MongoDB binary was unable to even get running at all after upgraded (looks like the binary was incompatible with this x86_64 architecture, although we got the right version) and generating the following traps in dmesg:

[root@my-stconfd03 PoC : ~]$ dmesg | grep mongo
[83968.937874] traps: mongod[7958] trap invalid opcode ip:5581dcce7e8a sp:7fff89a5bd60 error:0 in mongod[5581d8ad1000+5502000]
[84585.920334] traps: mongod[8466] trap invalid opcode ip:564306a6ce8a sp:7ffe11ee1790 error:0 in mongod[564302856000+5502000]
[84889.251127] traps: mongod[8513] trap invalid opcode ip:55ac0e60be8a sp:7ffc39496f80 error:0 in mongod[55ac0a3f5000+5502000]
[85035.943714] traps: mongod[9394] trap invalid opcode ip:56087e690e8a sp:7ffece80d000 error:0 in mongod[56087a47a000+5502000]
[85117.375184] traps: mongod[9673] trap invalid opcode ip:55f56fdb8e8a sp:7ffc0034ca20 error:0 in mongod[55f56bba2000+5502000]
[85189.783938] traps: mongod[10027] trap invalid opcode ip:560c1ed45e8a sp:7ffe951d7d70 error:0 in mongod[560c1ab2f000+5502000]
[85246.442257] traps: mongod[10712] trap invalid opcode ip:555a12cd0e8a sp:7ffe1b221330 error:0 in mongod[555a0eaba000+5502000]
[85284.273771] traps: mongod[12079] trap invalid opcode ip:55d394688e8a sp:7fff09d7bbb0 error:0 in mongod[55d390472000+5502000]

Troubleshooting Steps

Initial finding led us to the following threads:

According to the above threads, we tried to compared the host’s processor and we concluded our research as follows: 

  • Current processor: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz (Cascade Lake)
  • Sandy Bridge Launched on January 9, 2011; 11 years ago 
  • Cascade Lake Launched on April 2, 2019; 2 years ago 
  • Cascade Lake is newer by 9 years while Sandy Bridge is an older version of CPU.
  • Code=exited, status=132

Upon research, we also found these useful threads to understand further on the issue:

From the flags of “lscpu” output, it doesn’t return AVX, despite the fact that Cascade Lake should return the AVX flag:

[root@my-stconfd03 PoC : ~]$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       2
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           26
Model name:                      Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
Stepping:                        4
CPU MHz:                         2893.202
BogoMIPS:                        5786.40
Hypervisor vendor:               VMware
Virtualization type:             full
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        2 MiB
L3 cache:                        44 MiB
NUMA node0 CPU(s):               0,1
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm pti ssbd ibrs ibpb stibp tsc_adjust arat flush_l1d arch_capabilities

Solution

The following articles provide ways to solve the issue:

We managed to solve this issue by enabling the AVX CPU flag on the hypervisor/BIOS level. The upgrade to MongoDB 5.0 was completed successfully.

Related Post: