Forums :
General Topics :
VM job unmanageable, restarting later...
Message board moderation
Author | Message |
---|---|
doug Send message Joined: 30 Oct 17 Posts: 5 Credit: 1,009,590 RAC: 0 |
Hi, Now that I'm again getting some C@H jobs again, I've suddenly just in the last few days been getting this error: Task camb_boinc2docker_3006907_1638361207.322645_0 postponed for 86400 seconds: VM job unmanageable, restarting later. Anyone have any idea what this means? I've tried waiting, which did nothing, and I've tried aborting such jobs, but then I get the same error on at least some succeeding jobs. Thanks. Doug |
Jonathan Send message Joined: 27 Sep 17 Posts: 190 Credit: 8,338,009 RAC: 5 |
The Vboxwrapper is losing communication with VM or not getting a timely response. The stuck tasks will start back up after 24 hours. Vboxwrapper runs at a lower priority than the other processes and sometimes gets starved. I looked at a few of your completed jobs. You have it set to use 1 cpu per task. How many concurrent jobs are you running? You can try cutting down on that. Are you running other VBox projects or other boinc projects? I find that I can't run Vbox related task with the computer using 100% processing. It also chokes when too many VBox tasks are in use, ie. starting and stopping. |
doug Send message Joined: 30 Oct 17 Posts: 5 Credit: 1,009,590 RAC: 0 |
Hi Jonathon, I am running tasks from multiple projects, though a couple of them are GPU only, so presumably are not part of the problem here. In addition to C@H, I run Rosetta@Home (which uses VBOX for some tasks), LHC@Home (which does use VBOX, though I have this project suspended for now), and World Community Grid OpenPandemics - COVID-19 (which I believe does NOT use VBOX). I have the BOINC computing preferences "use no more than x GB" for disk space set to 40GB, but I have plenty so can increase that to anything reasonable. I have the BOINC computing preferences for Memory "when compute is not in use, use at most x%" set to 90% - I'm not sure if that is what you are asking about. My machine has 16G of RAM. Back before the pandemic (when I switched to only Rosetta and WGI), I was running C@H for a couple of years with no problems I remember. I'm not sure what would have changed to cause this issue. I'd really appreciate any other suggestions you can make. Thanks. Doug |
Jonathan Send message Joined: 27 Sep 17 Posts: 190 Credit: 8,338,009 RAC: 5 |
Virtual Box version 5 seemed to work better and it was rare to see this issue. Version 6 and above we started to have more of these types of errors. I use either project preferences and/or app_config.xml to control the Virtual Box related tasks. It take a bit of work to find the correct balance on how many task and what kind you can run from each project. Try setting the Cosmology at home preferences for Max # jobs and Max # cpus to control this project Max # cpus sets the number of cores assigned to each VM. I have 8 true cores on my Ryzen 1700 and I am running two, two core Cosmology task along with one, four core LHC / Atlas task. SMT / hyperthreading is on so I am only seeing about 50% processor usage. Try Max # job set to 1 and Max # cpus set to 1 or 2. See if it is stable running that way. You can always increase the Max # jobs for more tasks concurrently. I don't think you will be able to run two, two core VM tasks without errors even if you suspend other projects temporarily. |
doug Send message Joined: 30 Oct 17 Posts: 5 Credit: 1,009,590 RAC: 0 |
I'm not sure of how your recommendations translate to the app_config.xml. This is mine for C@H: <app_config> <app> <name>camb_boinc2docker</name> <max_concurrent>3</max_concurrent> </app> <app_version> <app_name>camb_boinc2docker</app_name> <plan_class>vbox64_mt</plan_class> <avg_ncpus>1</avg_ncpus> </app_version> </app_config> Where does "Max # jobs" fit into this? And is "Max # cpus" the same as <max_concurrent>? Thanks. Doug |
Jonathan Send message Joined: 27 Sep 17 Posts: 190 Credit: 8,338,009 RAC: 5 |
Max # cpus = avg_ncpus Max # Jobs = max_concurrent. If you are already using app_config.xml, just stick with that. Your app_config should be running three jobs using a single core in each VM from what you posted. It may make a difference as to where your Boinc data directory is. Hard drive vs SSD vs NVMe. I haven't really played around with that too much though as I don't have much space on my NVMe. I have my data directory on a HD. |