Thursday, August 23, 2018

Research Note #16 - WRF Errors Showcase

I've been working with WRF model for about 4 years, which is actually not so long for someone who has been studying atmospheric science for almost 20 years. During that time, I've switched from WRF-EMS (currently known as STRC-UEMS), WRF-ARW and finally WRF-Chem. While it's quite a short period, I've experienced the mess with WRF so many times, and I think it's time to recap those issues, not just for assisting my own future research, but for others who are studying in the same field.

FYI, I've run WRF model using supercomputer of UTokyo, Oakforest PACS, with the total 8208 computation nodes and peak performance of 25 PFLOPS. Anyway, I usually only use 16 nodes with 32 MPI processes for each simulation batch jobs, because I think it's more than enough.

Without further ado, here are the most common errors I've ever experienced with WRF model. Oh yeah, I will not list errors due to typos in the namelist.input (e.g. you forgot to match the grid resolution between namelist.input and namelist.wps, forget to put comma, period etc.).

1. CFL Error


Symptoms: 
  • WRF generates messages such as : "x points exceeded cfl=x in domain d0x at time ...
  • Simulation speed degrades or simulation completely stops.
Causes:
  • Model becomes unstable, mostly because the time-step used is too large for stable solution, especially while using high-res simulation grids.
  • Conflicts among model physics/dynamics/domains configuration.
Solutions:
  • Decrease the time-step (namelist.input > &domain > time_step).  The most common used convention is 6*DX in kilometers. That means, if the grid resolution is 10 km, then use at least 60 seconds time step. If the messages still appear, decrease the time step to 30 or 10 seconds.
  • Check the parameterization/configuration used in namelist.input which could potentially cause conflicts or model crash. I usually discard some parameterization schemes, and check them individually to see if I they are the causes of the crash.        
2. Flerchinger Messages

Symptoms:
  • WRF generates messages such as: "Flerchinger USEd in NEW version. Iterations= x"
  • Simulation still runs but the speed degrades.
  • WRF restart files are not generated.
Causes:
  • It's basically not an error, but a message generates by Noah LSM (namelist.input > &physics > sf_surface_physics), when the model output soil temperature which is super low/negative soil moisture. I experienced this while running simulation over Russia, with lon and lat > 55 degree. 
Solutions:
  • Change the surface physics to other options
  • Change the input data. I changed from GFS FNL ds083.2 into ECMWF ERA-Interim, and I've never met such messages ever again, even if I still use Noah LSM.


3. No WRF Simulation Log (rsl.out or rsl.error) Generated

Symptoms:
  • While running real.exe or wrf.exe, there are no logs of simulation. Instead of log file, the steps are shown on the screen (stdout).
Causes:
  • Well ... while it's quite strange and stupid (which I experienced), but this is absolutely not an error. It happens when you compiled WRF binaries for serial computation, instead of parallel ones (dmpar or smpar).
Solutions:
  • Recompile the model using dmpar or smpar or both, then you'll get your logs back.

4. Real.exe Error : Interpolation Order Error


Symptoms:
  • While running real.exe, the process is stopped at certain point, indicating that there are to few data for the interpolation order with Real.exe
Causes:
  • I'm not so sure about this, but I think that was caused when real.exe finds that the data for vertical interpolation is inadequate for model run. This was happened when I use GFS FNL ds083.2 for simulation over Russia, and I got so many warnings about missing level data, before this error occurred.
Solutions:
  • Change the data. I used ECMWF Era-Interim and the errors were gone.

5. WRF Simulation Sudden Death

Symptoms:
  • Model crashed. Real.exe and Wrf.exe are abruptly stopped, without any error messages in log files. It just stop.
Causes:
  • I hate this error because it might caused by many factors, but mostly because of the conflicts within the model configuration. For example, I was using WSM-3 MP parameterization schemes, with RRTM schemes for LW and SW, with domain over high latitude and complex terrain, 10 km resolution, using several computation nodes, then many strange things happened: the model crashed many times, could only stable while running on single node, etc.  
  • On several cases, it's also caused by too large time-step similar with CFL error.
  • Sometimes, it also occurs if the domain is too large, in particular when grid size < 10 km with complex terrain.
 Solutions:
  • Change the model configuration. For my case, I used Lin MP scheme with new RRTM schemes, and the error was gone.
  • Reduce the time-step in the factor of 2 (half of time-step first, if still not works, try 0.25 of the original time-step, and so on).
  • Reduce the domain size. 

6. Mismatch Landmask ivgtyp
Symptoms:
  • Real.exe crashed, giving error message: 
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE: LINE: 2963
mismatch_landmask_ivgtyp
-------------------------------------------

Causes:
  • By far, only occurred while using ECMWF Era-Interim dataset. The reason is unknown. It's documented on WRF-ARW website.  
Solutions:
  • Change the value of 'surface_input_source' on &physics parameter of namelist.input from '3' to '1'

7. Ungrib.exe Segmentation Fault (End Date)

Symptoms:
  • Ungrib.exe crashes at the ungribbing process at the end date, giving error message: "Segmentation fault ..."
Causes:
  • Probably, it has something with computer memory, because when I set ulimit to unlimited, the problem was solved.
Solution:
  • ulimit -s unlimited 

8. Metgrid.exe error in ext_pkg_write_field

Symptoms:
  • Metgrid.exe crashes at the beginning of the process with messages: 'ERROR: Error in ext_pkg_write_field'.
Causes:
  • This will happen when new NCEP GFS data (Version 15.1 or higher) was process using old version of ungrib.exe (< Ver. 4).
Solution:
  • Install Ungrib from WPS Ver.4. (the old geogrid/metgrid still could be used). 

2 comments:

  1. izin bertanya kak, kalau tidak ada eror setelah dicek pada saat real.exe yang kedua (menggunakan wrf chem), namun tidak keluar wrf.out nya itu kira2 kenapa ya? trima kasih

    ReplyDelete
    Replies
    1. Sepertinya itu error. Coba cek isi file rsl.out.0000 atau rsl.error.0000. Nanti ketahuan errornya di mana.

      Delete