Message boards : Number crunching : Errors while computing on CONDOR Cluster
Message board moderation

To post messages, you must log in.

AuthorMessage
Athmos

Send message
Joined: 29 Jan 13
Posts: 3
Credit: 33,434
RAC: 0
Message 25349 - Posted: 7 Feb 2013, 11:35:28 UTC

Hi everyone!

I am runnig 100 Instances of BOINC on a CONDOR computer cluster. But my Boinc instances seem to crash for no (apparent) reason and my credit for jobs seems to be minimal (at max 12 Credits).
Does anyone have any Idea how to solve this Problem? There would be a huge computing potential on this Cluster. It has about 200 free Slots most of the time, potentially more.


Infos about the Setup:

Installed is Boinc 6.12.34 (because 7.* needs newer glibc) and I started my Instances with following bash script:

./boinc --no_gui_rpc --allow_multiple_clients --exit_after_app_start 864000

The CONDOR cluster assigns 1 slot for every core on a Node, so there should more than one client be able to run on a multicore Node. The exit_after_.. is only for the sake of saveness (864k sec = 10d).

Errorlogs:
http://pastebin.com/tHjFtbC0

Typical Hardware of a Node: (there are a few other)

GenuineIntel
Intel(R) Core(TM) i5-2400S CPU @ 2.50GHz [Family 6 Model 42 Stepping 7]
(4 processors)
ID: 25349 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 805
Credit: 650,649,682
RAC: 269,030
Message 25362 - Posted: 7 Feb 2013, 23:04:58 UTC

It's possible that the task was only 12sec e.g.:

http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=6460914

Seems like BOINC can't access the files properly too?

I can't help much with the UNIX side of things
ID: 25362 · Report as offensive     Reply Quote
S. Dagorath

Send message
Joined: 7 Feb 13
Posts: 19
Credit: 1,478
RAC: 0
Message 25363 - Posted: 7 Feb 2013, 23:31:43 UTC - in response to Message 25349.  

I think each BOINC instance needs to have it's own data directory. Each instance maintains a state file named client_state.xml so if you have more than 1 instance pointed at the same data directory then they all try to read/write the same state file which of course does not work. You could try something like:

# start BOINC instance #1
./boinc --dir /boinc/instance_1/ --gui_rpc_port <port> --no_gui_rpc --allow_multiple_clients --exit_after_app_start 864000

# start BOINC instance #2
./boinc --dir /boinc/instance_2/ --gui_rpc_port <port> --no_gui_rpc --allow_multiple_clients --exit_after_app_start 864000

# start BOINC instance #3
./boinc --dir /boinc/instance_3/ --gui_rpc_port <port> --no_gui_rpc --allow_multiple_clients --exit_after_app_start 864000


If you know bash script then you would want to use a loop to increment the instance number and port number (more on port number below).

3 other things to consider:

1) The --exit_after_app_start parm might not be necessary. Each task received by BOINC client is tagged with a maximum duration. If the task reaches that maximum BOINC will automatically abort it, abort the task, not kill itself.

2) This project is frequently "dry" which means it frequently has no tasks to send. The dry spells can last days or even weeks. You should consider attaching it to at least one other project that has a steady supply of tasks. Since you would have about 200 instances running I would suggest a very stable, mature, trouble free project such as Numberfields@home to reduce the amount of babysitting. ABC@home is also very stable but they have many dry spells too. The T4T project is this project's sister project. It is sponsored by CERN and assists the work at the LHC. But be careful with T4T because there have been some problems with the VM, more so on Windows than Linux. They have recently solved the worst problems and it is far more stable than it was. Be sure to research it first and try it on a few instances of BOINC before deploying across 200 instances. If it will work for you it will work very well and be very stable. Like I said less trouble with it on Linux than Windows. T4T has a constant supply of work too.

3) The --no_gui_rpc parm.... if you don't allow GUI RPC then you have no way to control the clients. If you use this for security considerations then look at the use of the GUI RPC password in the gui_rpc_auth.cfg file and also implement the remote_hosts.cfg file in which you list the IP addresses of remote hosts that are allowed to connect to BOINC client. Any address not in the list is blocked. You can also put hostnames in remote_hosts.cfg but of course they must resolve somehow to an address, usually by being included in /etc/hosts. Each client instance would need to be instantiated with its own port number as illustrated in the example script above. The least recommended method of monitoring/controlling the client instances would be BOINC manager. A recommended way would be using the boinccmd tool as it is CLI and therefore scriptable. There is also a highly recommended ncurses based app named boinctui which you might find very useful for monitoring but perhaps not so much for controlling (scripted boinccmd for that). Also, there is a Windows GUI app named BOINCtasks that is like BOINC manager but it can monitor and control multiple clients simultaneously. I have never used it but they say it runs very well on wine.

Are you aware of the official BOINC wiki?

ID: 25363 · Report as offensive     Reply Quote
Athmos

Send message
Joined: 29 Jan 13
Posts: 3
Credit: 33,434
RAC: 0
Message 25394 - Posted: 12 Feb 2013, 0:44:12 UTC

Thanks for your advice, I will now start every instance in it's own directory.

The --no_gui_rpc command is in, because it wouldn't start and it doesn't matter, once a job is submitted to the cluster, I have no control over it (i could ssh to the job if it's running, but there are at least 100 Jobs). I can kill it, but nothing else. This is also why I have the --exit_after_app_start param set to 10 days, since I'm still testing.

I think I found another important problem:
I read some more about CONDOR and BOINC (yes, I found the wiki) and it seems like BOINC evaluates the SSE edition of the CPU and lhcathome (or other projects) have their respective WU optimised for diffrent Instruction sets.
But CONDOR reassigns an Instance of a Job (a BOINC instance) to another Cluster-Node if it is free. I think this leads to the case that a WU wants to calculate with an Instruction Set that is not available.
I googled and fought through wikis if I could restrict CONDOR on specific intstruction sets, but this doesn't seem to work (documentation is a bit lousy). The oldest CPU is an "Intel Core 2 6600" all others are "Intel Core i5-2400S", I guess the Core 2 6600 has the smaller Instruction set, so could I just limit boinc somehow to only use WUs with one Instruction set that are available on both CPUs? Or any other Ideas?

As soon as it works 'stable' I will add another project, suggestions?
ID: 25394 · Report as offensive     Reply Quote
Christoph

Send message
Joined: 25 Aug 05
Posts: 69
Credit: 306,627
RAC: 0
Message 25418 - Posted: 13 Feb 2013, 21:14:07 UTC - in response to Message 25394.  

I guess the Core 2 6600 has the smaller Instruction set, so could I just limit boinc somehow to only use WUs with one Instruction set that are available on both CPUs? Or any other Ideas?

Maybe with the 'Anonymous Platform' mechanism. It is meant to run third party optimised programs.
You will need to take care of the program version yourself since this will block the automatic upgrade if the project offers a new program.

As soon as it works 'stable' I will add another project, suggestions?

Depends to your interests. Have a look here.

If there is another way somebody will point it out.
Christoph
ID: 25418 · Report as offensive     Reply Quote
S. Dagorath

Send message
Joined: 7 Feb 13
Posts: 19
Credit: 1,478
RAC: 0
Message 25421 - Posted: 13 Feb 2013, 23:25:31 UTC - in response to Message 25418.  

I'm not a CONDOR expert. Never used one, never even seen one. I saw a giant condor bird in zoo once and that's the closest I've been to a CONDOR ;-)

In spite of that limitation on my part I think we can arrive at a few answers strictly from first principles without being experts, for example the basic need for each BOINC instance to run in its own dir.

Now this thing with instruction sets... hmmmm.... Christoph might be right, i'm not sure. I only want to correct one small misunderstanding and that is about the tasks. I think probably the tasks are all the same in other words it doesn't matter which instruction set the CPU has it will get the same tasks. The difference is in the various applications. Those are listed here.

At that point my understanding is deficient so now I am asking more than I am suggesting. I don't think we can use an app_info.xml to tell the server which app version to send. I might be completely mistaken and someone please correct me if I am wrong but I think what Athmos must do is manually download the standard application (the one that is not optimised for any instruction set), place it in the proper directory where BOINC will find it (probably .../BOINC/projects/lhcathomeclassic.cern.ch_sixtrack) then use the app_info.xml to tell BOINC to use that app instead of the app the project server sends. Otherwise I think BOINC client reports the CPU instruction set extensions to the project server and the server then uses that info to determine which app to send.
ID: 25421 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 25422 - Posted: 13 Feb 2013, 23:52:11 UTC - in response to Message 25421.  

OK, on the strict understanding that we're talking facts, not opinions (like in the adjacent thread), here we go on instruction sets.

It's clear from the applications page previously linked, that the apps supplied here fall into four groups:

* SSE2
* SSE3
* PNI
* None of the above

Now, look at the Wiki page for SSE3, Streaming SIMD Extensions 3. Quoting:

"Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU" and "SSE3 ... also known by its Intel code name Prescott New Instructions (PNI)"

In other words, the different apps here reflect technology developments (and they were very important) that took place almost a decade ago.

The Q6600 - I've been running mine since soon after their launch in 2007 - have all of the above, and then some. The i5 has all that the Q6600 has, and then some more. [Anybody who is having difficulty getting to sleep can Google SSSE3, SSE4.1, AVX and any number of other TLAs]

In short, whatever was causing Athmos's tasks to fail, it wasn't an instruction set deficiency. Every application here will run on every CPU he's listed, with headroom to spare.
ID: 25422 · Report as offensive     Reply Quote
Athmos

Send message
Joined: 29 Jan 13
Posts: 3
Credit: 33,434
RAC: 0
Message 25423 - Posted: 14 Feb 2013, 1:39:39 UTC - in response to Message 25422.  

Thanks all, that really helped!

I should have known this and now I feel stupid ;).

I created separate directories for every instance and added seti@home as a second project, for now it seems to work stable.

Actually the Cluster looks quite boring, a Condor Cluster is usually not a dedicated Cluster. This means that every Computer on my Campus is a Node and works on a Condor Job as long as nobody logs on. But the Workstations run 24/7 and most of the time nobody is logged on.

The correct description of a Condor Cluster:
"HTCondor is a specialized workload management system for compute-intensive jobs"


ID: 25423 · Report as offensive     Reply Quote
S. Dagorath

Send message
Joined: 7 Feb 13
Posts: 19
Credit: 1,478
RAC: 0
Message 25425 - Posted: 14 Feb 2013, 2:14:36 UTC - in response to Message 25394.  
Last modified: 14 Feb 2013, 2:16:48 UTC

The oldest CPU is an "Intel Core 2 6600" all others are "Intel Core i5-2400S", I guess the Core 2 6600 has the smaller Instruction set


Ack!!! And I swallowed it, hook, line and sinker. Sorry, I should have known better.

Glad it works but I am really sorry to hear you've attached it to SETI, a project whose chances of success are so close to zero we may as well just call it 0, when there are so many projects whose chances are so much higher and who will give us something useful if they succeed. The most SETI can ever do is confirm what we already intuitively "know"... that there are other or have been other intelligent life forms. This is a fact, not an opinion, SETI's methodology is flawed to the point of being assinine. If there was nothing else to spend unused CPU cycles on then do SETI but that is not the case.

Thanks for the info on CONDOR clusters.
ID: 25425 · Report as offensive     Reply Quote

Message boards : Number crunching : Errors while computing on CONDOR Cluster


©2024 CERN