Advanced search

Forums : General Topics : VM job unmanageable, restarting later...
Message board moderation

To post messages, you must log in.

AuthorMessage
doug

Send message
Joined: 30 Oct 17
Posts: 5
Credit: 1,009,590
RAC: 0
Message 22913 - Posted: 1 Dec 2021, 22:08:48 UTC

Hi,

Now that I'm again getting some C@H jobs again, I've suddenly just in the last few days been getting this error:

Task camb_boinc2docker_3006907_1638361207.322645_0 postponed for 86400 seconds: VM job unmanageable, restarting later.

Anyone have any idea what this means? I've tried waiting, which did nothing, and I've tried aborting such jobs, but then I get the same error on at least some succeeding jobs.

Thanks.

Doug
ID: 22913 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 27 Sep 17
Posts: 190
Credit: 8,338,009
RAC: 5
Message 22915 - Posted: 2 Dec 2021, 2:55:29 UTC - in response to Message 22913.  

The Vboxwrapper is losing communication with VM or not getting a timely response. The stuck tasks will start back up after 24 hours. Vboxwrapper runs at a lower priority than the other processes and sometimes gets starved.
I looked at a few of your completed jobs. You have it set to use 1 cpu per task. How many concurrent jobs are you running? You can try cutting down on that. Are you running other VBox projects or other boinc projects?
I find that I can't run Vbox related task with the computer using 100% processing. It also chokes when too many VBox tasks are in use, ie. starting and stopping.
ID: 22915 · Report as offensive     Reply Quote
doug

Send message
Joined: 30 Oct 17
Posts: 5
Credit: 1,009,590
RAC: 0
Message 22916 - Posted: 2 Dec 2021, 3:31:34 UTC - in response to Message 22915.  

Hi Jonathon,

I am running tasks from multiple projects, though a couple of them are GPU only, so presumably are not part of the problem here.

In addition to C@H, I run Rosetta@Home (which uses VBOX for some tasks), LHC@Home (which does use VBOX, though I have this project suspended for now), and World Community Grid OpenPandemics - COVID-19 (which I believe does NOT use VBOX).

I have the BOINC computing preferences "use no more than x GB" for disk space set to 40GB, but I have plenty so can increase that to anything reasonable.

I have the BOINC computing preferences for Memory "when compute is not in use, use at most x%" set to 90% - I'm not sure if that is what you are asking about. My machine has 16G of RAM.

Back before the pandemic (when I switched to only Rosetta and WGI), I was running C@H for a couple of years with no problems I remember. I'm not sure what would have changed to cause this issue.

I'd really appreciate any other suggestions you can make.

Thanks.

Doug
ID: 22916 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 27 Sep 17
Posts: 190
Credit: 8,338,009
RAC: 5
Message 22917 - Posted: 2 Dec 2021, 4:22:28 UTC - in response to Message 22916.  

Virtual Box version 5 seemed to work better and it was rare to see this issue. Version 6 and above we started to have more of these types of errors.
I use either project preferences and/or app_config.xml to control the Virtual Box related tasks. It take a bit of work to find the correct balance on how many task and what kind you can run from each project.

Try setting the Cosmology at home preferences for Max # jobs and Max # cpus to control this project
Max # cpus sets the number of cores assigned to each VM.

I have 8 true cores on my Ryzen 1700 and I am running two, two core Cosmology task along with one, four core LHC / Atlas task. SMT / hyperthreading is on so I am only seeing about 50% processor usage.

Try Max # job set to 1 and Max # cpus set to 1 or 2. See if it is stable running that way. You can always increase the Max # jobs for more tasks concurrently. I don't think you will be able to run two, two core VM tasks without errors even if you suspend other projects temporarily.
ID: 22917 · Report as offensive     Reply Quote
doug

Send message
Joined: 30 Oct 17
Posts: 5
Credit: 1,009,590
RAC: 0
Message 22918 - Posted: 2 Dec 2021, 6:14:51 UTC - in response to Message 22917.  

I'm not sure of how your recommendations translate to the app_config.xml. This is mine for C@H:

<app_config>
<app>
<name>camb_boinc2docker</name>
<max_concurrent>3</max_concurrent>
</app>
<app_version>
<app_name>camb_boinc2docker</app_name>
<plan_class>vbox64_mt</plan_class>
<avg_ncpus>1</avg_ncpus>
</app_version>
</app_config>

Where does "Max # jobs" fit into this? And is "Max # cpus" the same as <max_concurrent>?

Thanks.

Doug
ID: 22918 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 27 Sep 17
Posts: 190
Credit: 8,338,009
RAC: 5
Message 22919 - Posted: 2 Dec 2021, 6:41:14 UTC - in response to Message 22918.  

Max # cpus = avg_ncpus

Max # Jobs = max_concurrent.

If you are already using app_config.xml, just stick with that.

Your app_config should be running three jobs using a single core in each VM from what you posted.
It may make a difference as to where your Boinc data directory is. Hard drive vs SSD vs NVMe. I haven't really played around with that too much though as I don't have much space on my NVMe. I have my data directory on a HD.
ID: 22919 · Report as offensive     Reply Quote

Forums : General Topics : VM job unmanageable, restarting later...