Some nodes failed to boot after kernel batch upgrade

  • A+

Today my friend upgraded some nodes and a small part of them failed to boot after the upgrade.

He checked such systems and reinstall the new installed kernel, then reboot them and we could enter the systems without any issue.

I recalled I had resolved similar issues on last round of maintenance and at that time I did not figure out what caused this issue, while this time I will not let it go.

So, any difference between the upgrade by our script and the manual upgrade?

Messages log files are always our best friend in such situation and I extracted both related parts here:

Upgrade by our script:

Upgrade manually:

So it is clear the first one missed below words:

*** Creating initramfs image file '/boot/initramfs-3.10.0-1062.12.1.el7.x86_64.img' done ***

The reason is that we rebooted the nodes too fast that the initramfs images were not finished.

We upgraded the manged nodes using scripts so we could do the upgrade on hundreds of nodes at the same time, while we must confirm the script was finished and the new kernel initramfs images were generated before we rebooted them.

The case itself is really simple and here I want to share is that we should not just run something and believe they will work well. We have to do some checks to confirm the results. This simple thinking will improve our script stablity dramatically with acceptable effert.


:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: