Message boards : CMS Application : CMS computation error
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
San-Fernando-Valley

Send message
Joined: 26 Mar 16
Posts: 30
Credit: 1,258,609
RAC: 0
Message 28929 - Posted: 20 Feb 2017, 14:32:56 UTC

I'm experiencing CMS failures/dropouts on three of my PCs after about 11 minutes run-time.

Theory Simulation is doing fine (so far - 4:30 hours of crunching).

QUESTION:
Which Version of VBox should be used?

On the download page of LHC the release to be used is stated as V5.0.18 !!
I also read (don't remeber where) NOT to use a newer version.
Whereas reading in the message boards it says to use at least V5.1 !!

ANSWER: ??
ID: 28929 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1110
Credit: 9,473,035
RAC: 9,898
Message 28930 - Posted: 20 Feb 2017, 14:51:19 UTC - in response to Message 28929.  

The CMS job queue was allowed to drain for a server upgrade, so your tasks weren't finding any jobs to run. See the News item from last week. The upgrade is over now, and I've just submitted a new batch of jobs. They should be available soon.

As far as I'm aware, the latest version of VirtualBox is now OK to use. There were some problems a while back but I believe that's all been sorted now.
ID: 28930 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1461
Credit: 9,860,999
RAC: 2,330
Message 28931 - Posted: 20 Feb 2017, 14:52:55 UTC

Your CMS errors are because of the project has drained the well of CMS-jobs (something different to BOINC CMS-tasks) for maintenance.

Read from the News section -> https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4124

On CERN's Join Us! webpage the install VirtualBox is directing to VirtualBox Downloads,
where you'll find the newest version.

In Berkeley's BOINC package an old VirtualBox is included.
ID: 28931 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1110
Credit: 9,473,035
RAC: 9,898
Message 28933 - Posted: 20 Feb 2017, 16:05:00 UTC - in response to Message 28930.  

The CMS job queue was allowed to drain for a server upgrade, so your tasks weren't finding any jobs to run. See the News item from last week. The upgrade is over now, and I've just submitted a new batch of jobs. They should be available soon.

Hmm, there are jobs in the queue but my VMs aren't downloading any. I've pinged Laurence.
ID: 28933 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 28934 - Posted: 20 Feb 2017, 17:18:09 UTC - in response to Message 28933.  

I just picked up one a couple of hours ago, and it is sitting in my buffer.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=57985030
ID: 28934 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1461
Credit: 9,860,999
RAC: 2,330
Message 28935 - Posted: 20 Feb 2017, 17:39:37 UTC - in response to Message 28934.  

I just picked up one a couple of hours ago, and it is sitting in my buffer.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=57985030

Getting a BOINC CMS-task is not the same as getting a job into your running CMS-VM.
ID: 28935 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 882
Credit: 747,391,919
RAC: 323,695
Message 28937 - Posted: 20 Feb 2017, 18:08:17 UTC

My feedback would be to drain boinc queues if possible? It's less confusing for users
ID: 28937 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 28938 - Posted: 20 Feb 2017, 18:12:14 UTC - in response to Message 28935.  

Getting a BOINC CMS-task is not the same as getting a job into your running CMS-VM.

Yes, I see. I normally don't bother to install the extension pack to check under Linux, but they are ending after 13 minutes, so no go.
ID: 28938 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1110
Credit: 9,473,035
RAC: 9,898
Message 28941 - Posted: 20 Feb 2017, 21:55:41 UTC - in response to Message 28938.  

Getting a BOINC CMS-task is not the same as getting a job into your running CMS-VM.

Yes, I see. I normally don't bother to install the extension pack to check under Linux, but they are ending after 13 minutes, so no go.

Yes, sorry, we have 700 jobs in the queue and another 300 created, but they are not being sent out. I don't think there's anything else I can do from here but wait for some response from the CERN crew.
ID: 28941 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1461
Credit: 9,860,999
RAC: 2,330
Message 28945 - Posted: 21 Feb 2017, 11:14:09 UTC - in response to Message 28941.  

... we have 700 jobs in the queue and another 300 created, but they are not being sent out. I don't think there's anything else I can do from here but wait for some response from the CERN crew.


Finally got one running: wmagent_ireid_MonteCarlo_eff_IDR_CMS_Home_170220_154632_5171/b8e42212-f825-11e6-b3b7-02163e018309-512_0

, but I see after the first 10 tries were aborted the other 690 are cancelled.
Do you have green light for the 300?
ID: 28945 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1110
Credit: 9,473,035
RAC: 9,898
Message 28946 - Posted: 21 Feb 2017, 13:11:13 UTC - in response to Message 28945.  

... we have 700 jobs in the queue and another 300 created, but they are not being sent out. I don't think there's anything else I can do from here but wait for some response from the CERN crew.


Finally got one running: wmagent_ireid_MonteCarlo_eff_IDR_CMS_Home_170220_154632_5171/b8e42212-f825-11e6-b3b7-02163e018309-512_0

, but I see after the first 10 tries were aborted the other 690 are cancelled.
Do you have green light for the 300?

We had two problems caused by changes to the server. I aborted one batch as it was "doomed to fail" anyway and we've started on another batch that I'd also submitted yesterday. Some of our monitors are not showing jobs properly yet (Dashboard, of course...) but we have a queue maintained at 700 jobs and are up to over 90 running jobs as far as I can see. I'm running a total of 14 jobs on my machines -- should be more but the scheduler on the 12-core machine isn't asking for more than 6 jobs; "Not needed".
ID: 28946 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 468
Credit: 214,945,248
RAC: 46,614
Message 28948 - Posted: 21 Feb 2017, 13:26:10 UTC - in response to Message 28946.  

I'm running a total of 14 jobs on my machines -- should be more but the scheduler on the 12-core machine isn't asking for more than 6 jobs; "Not needed".

Try to increase the BOINC-Queue:

My Preferences show up in German:

Speichere mindestens (Tage)
Speichere zusätzlich für weitere (Tage)


Supporting BOINC, a great concept !
ID: 28948 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1110
Credit: 9,473,035
RAC: 9,898
Message 28949 - Posted: 21 Feb 2017, 14:02:18 UTC - in response to Message 28948.  

Thanks, that did it. I hadn't realised they were set to 0.50 and 0.01 days. Changed the minimum to 2.5 days and four more tasks immediately downloaded and started (I have the machine set to run 10 CMS tasks; single-core as it's LHC@Home, not -dev).
ID: 28949 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2278
Credit: 178,709,076
RAC: 90,167
Message 29150 - Posted: 9 Mar 2017, 21:27:55 UTC
Last modified: 9 Mar 2017, 21:28:35 UTC

This CMS-Task ended with Error after 12 hours:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=124403989

Will test it with Virtualbox 5.1.16 again.
ID: 29150 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2278
Credit: 178,709,076
RAC: 90,167
Message 34389 - Posted: 16 Feb 2018, 9:07:21 UTC

This CMS-Task show this message:
WU not found and is not deleted from Server.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=176726472
Thank you for help.
ID: 34389 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2278
Credit: 178,709,076
RAC: 90,167
Message 34425 - Posted: 20 Feb 2018, 8:36:46 UTC

Nils told us in News Forum, this tasks where finished now.
Thank you!
ID: 34425 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1909
Credit: 145,004,303
RAC: 77,000
Message 34436 - Posted: 21 Feb 2018, 6:08:43 UTC

For the first time, a CMS tasks errored out, after 6+ hours, with

196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED

what does this mean?

Here is the Stderr: https://lhcathome.cern.ch/lhcathome/result.php?resultid=179062024
ID: 34436 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 781
Credit: 60,045,758
RAC: 45,483
Message 34439 - Posted: 21 Feb 2018, 8:34:59 UTC - in response to Message 34436.  

For the first time, a CMS tasks errored out, after 6+ hours, with

196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED

what does this mean?

Here is the Stderr: https://lhcathome.cern.ch/lhcathome/result.php?resultid=179062024

Usually this means that the project has set the disk limit parameter too low for this task. It also can be that the task actually has a fault.
ID: 34439 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1110
Credit: 9,473,035
RAC: 9,898
Message 34440 - Posted: 21 Feb 2018, 10:01:23 UTC - in response to Message 34439.  

For the first time, a CMS tasks errored out, after 6+ hours, with
196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED
what does this mean?
Here is the Stderr: https://lhcathome.cern.ch/lhcathome/result.php?resultid=179062024
Usually this means that the project has set the disk limit parameter too low for this task. It also can be that the task actually has a fault.
If you haven't done so already, check your disk usage in boincmgr, and adjust parameters in Options -> Computing Preferences... -> Disk and memory if necessary.
ID: 34440 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1909
Credit: 145,004,303
RAC: 77,000
Message 34444 - Posted: 21 Feb 2018, 11:19:20 UTC - in response to Message 34440.  

If you haven't done so already, check your disk usage in boincmgr, and adjust parameters in Options -> Computing Preferences... -> Disk and memory if necessary.
this was first thing I checked; although I would have been surprised if that is the reason. Disk and memory usage are set to almost the maximum available
So I'll wait and see whether this problem comes up once more..
ID: 34444 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : CMS Application : CMS computation error


©2025 CERN