Message boards : Theory Application : New version 263.90
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile ritterm
Avatar

Send message
Joined: 30 May 08
Posts: 93
Credit: 5,160,246
RAC: 0
Message 37503 - Posted: 3 Dec 2018, 16:59:12 UTC - in response to Message 37498.  

However, even several hours after you wrote your posting, I got the "no subtasks" error many times...

Maybe I'm just lucky or am missing the point, but I have a Theory task that's been running on my 8-core AMD for over 8 hours. If it wasn't getting any jobs, wouldn't it crash, eventually?
ID: 37503 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,483,743
RAC: 104,419
Message 37504 - Posted: 3 Dec 2018, 17:08:11 UTC

Why is the tasks bucket again being filled up, if there are still/again not jobs?

A few minutes ago, I had another tasks that failed with the "no subtasks" error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=211171810

And again, I don't understand why Theory is not being stopped until all these permanently recurring problems are solved.
ID: 37504 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,483,743
RAC: 104,419
Message 37505 - Posted: 3 Dec 2018, 17:26:22 UTC - in response to Message 37504.  

A few minutes ago, I had another tasks that failed with the "no subtasks" error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=211171810
and here the next one which failed a minute ago:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=211172940
ID: 37505 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,040,502
RAC: 136,849
Message 37507 - Posted: 3 Dec 2018, 17:37:25 UTC

ATM it seems to be bare luck to get a subtask for fresh VMs.
ID: 37507 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,483,743
RAC: 104,419
Message 37513 - Posted: 3 Dec 2018, 21:08:08 UTC - in response to Message 37507.  

ATM it seems to be bare luck to get a subtask for fresh VMs.
and this has become standard procedure now? Seems like :-(
ID: 37513 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,504,188
RAC: 3,842
Message 37515 - Posted: 4 Dec 2018, 0:46:31 UTC

https://lhcathome.cern.ch/lhcathome/results.php?userid=5472

I have well over 100 * Compute error - EXIT_INIT_FAILURE - Condor exited after 728s without running a job*

And would have many more if I didn't get home and suspend all of them.
ID: 37515 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,483,743
RAC: 104,419
Message 37516 - Posted: 4 Dec 2018, 5:58:47 UTC

here, through all the night all tasks on all my machines failed with the "no subtasks" error.

For example:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10452404
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10542973
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10542973
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10544654

this is frustrating and annoying.
As stated before (and then by some co-crunchers I was accused of not being respectful enough): either something is going awfully wrong at LHC@home from a technical point, or they simply don't have the experts who would be needed to fix such problems that have occurred for weeks now.

We crunchers dedicate our equipment, our time, and our electricity - for nothing :-(
ID: 37516 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,192,791
RAC: 103,819
Message 37519 - Posted: 4 Dec 2018, 7:03:23 UTC - in response to Message 37516.  
Last modified: 4 Dec 2018, 7:12:59 UTC

We crunchers dedicate our equipment, our time, and our electricity - for nothing :-(

You can do other Boinc-work, if this problem is for some time.

Edit:
Laurence had in the past a thread with:
Respect my limit!
We all hope, that they find a solution, but
this need TIME.
ID: 37519 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,483,743
RAC: 104,419
Message 37520 - Posted: 4 Dec 2018, 7:58:00 UTC - in response to Message 37519.  

We all hope, that they find a solution, but this need TIME.
and until the solution is found, it would make sense to shut down the Theory subproject. What sense does it make to send out thousands of tasks that error out all the time?
ID: 37520 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 37521 - Posted: 4 Dec 2018, 9:45:55 UTC - in response to Message 37520.  
Last modified: 4 Dec 2018, 10:52:43 UTC

In the past the Theory app has been quite stable and according to MCPlots returning ~1K CPU hours. The number of jobs in progress was approximately ~2K and our queue was ~3K leave a 1K job buffer. This morning we had 6976 jobs in progress and so hit our 7K queue limit. This has now been increased to 8K. The issue is MCPlots is still reporting only ~1K CPU hours returned. Looking at the jobs in progress per host, there does not seem to be any hosts acting as a black hole and sucking all the jobs.
ID: 37521 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,040,502
RAC: 136,849
Message 37523 - Posted: 4 Dec 2018, 11:05:51 UTC - in response to Message 37521.  

Some comments. Mainly to see if I understand the process or not.

This morning we had 6976 jobs in progress and so hit our 7K queue limit. This has now been increased to 8K.

7 k was the #tasks limit that can be seen here:
https://lhcathome.cern.ch/lhcathome/server_status.php
It is now 8 k.

Until this limit is not reached a BOINC client that requests a task will get one (or more).
This tasks will start a VM or increase the client's local buffer.

If the limit is reached the client will get a "No tasks available ..." message.


The issue is MCPlots is still reporting only ~1K CPU hours returned.

VMs that don't process a subtask also don't add CPU hours.
Instead they shut down and the client starts a fresh VM.


There does not seem to be any hosts acting as a black hole and sucking all the jobs.

Not 1 single host, but all active host together.
They fight against each other to get the few available subtasks.


According to:
http://mcplots-dev.cern.ch/production.php?view=status&plots=hourly#plots
the # of available subtasks seems to be high enough but I'm curious why the distribution ratio seems to be much too low.

I guess that when this ratio rises the #task (from above) will also stabilize on a lower level as the mean runtimes per VM will increase.
ID: 37523 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,168,451
RAC: 16,096
Message 37524 - Posted: 4 Dec 2018, 12:09:45 UTC - in response to Message 37523.  


According to:
http://mcplots-dev.cern.ch/production.php?view=status&plots=hourly#plots
the # of available subtasks seems to be high enough but I'm curious why the distribution ratio seems to be much too low.

I guess that when this ratio rises the #task (from above) will also stabilize on a lower level as the mean runtimes per VM will increase.

Could somebody please explain how to read the above MCPLOT graphs. For example what is the "lost ratio" etc.?
ID: 37524 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,483,743
RAC: 104,419
Message 37525 - Posted: 4 Dec 2018, 12:13:15 UTC - in response to Message 37523.  


There does not seem to be any hosts acting as a black hole and sucking all the jobs.

Not 1 single host, but all active host together.
They fight against each other to get the few available subtasks.
which is bad enough, and rather frustrating ... :-(
ID: 37525 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 37526 - Posted: 4 Dec 2018, 12:32:59 UTC - in response to Message 37523.  
Last modified: 4 Dec 2018, 12:33:16 UTC


7 k was the #tasks limit that can be seen here:
https://lhcathome.cern.ch/lhcathome/server_status.php
It is now 8 k.

I was looking at our other server that is delivering the sub tasks. The numbers are roughly in agreement.


Until this limit is not reached a BOINC client that requests a task will get one (or more).
This tasks will start a VM or increase the client's local buffer.
If the limit is reached the client will get a "No tasks available ..." message.

There is a small buffer of tasks but there is no limit.



VMs that don't process a subtask also don't add CPU hours.
Instead they shut down and the client starts a fresh VM.

VMs that don't process a subtask also don't get a subtask and hence there would be a difference between the number of tasks and subtasks.


Not 1 single host, but all active host together.
They fight against each other to get the few available subtasks.

No, there are currently 7.5K subtasks 'running'
ID: 37526 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,040,502
RAC: 136,849
Message 37527 - Posted: 4 Dec 2018, 13:05:46 UTC - in response to Message 37526.  

No limit?
So, what is meant by the "limit" here?
Laurence wrote:
This morning we had 6976 jobs in progress and so hit our 7K queue limit. This has now been increased to 8K.

now:
Laurence wrote:
There is a small buffer of tasks but there is no limit.




Laurence wrote:
I was looking at our other server that is delivering the sub tasks. The numbers are roughly in agreement.

Sorry to ask again.
Are this the numbers mentioned as "#tasks in progress" at the server status page?


Laurence wrote:
hence there would be a difference between the number of tasks and subtasks.
...
No, there are currently 7.5K subtasks 'running'


Shouldn't there be a significant difference?
What about tasks that are sent out but remain unstarted in the client work buffers.
I would expect the #subtasks smaller than the #tasks.
Or at least different as it is a multicore app.
ID: 37527 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,040,502
RAC: 136,849
Message 37528 - Posted: 4 Dec 2018, 13:14:33 UTC - in response to Message 37525.  

There does not seem to be any hosts acting as a black hole and sucking all the jobs.

Not 1 single host, but all active host together.
They fight against each other to get the few available subtasks.

which is bad enough, and rather frustrating ... :-(

My comments were thought to be questions rather than statements.
Be so kind as to understand them that way, not as "truth".
ID: 37528 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 37529 - Posted: 4 Dec 2018, 13:19:44 UTC - in response to Message 37527.  

No limit?
So, what is meant by the "limit" here?


There is a limit on the number of subtasks but not the number of tasks.


Sorry to ask again.
Are this the numbers mentioned as "#tasks in progress" at the server status page?

The number of tasks is on the status page, the number of subtasks is only visible internally.



Shouldn't there be a significant difference?

Yes, if there were lots of VMs trying to get subtasks but that is not what we see.


What about tasks that are sent out but remain unstarted in the client work buffers.
I would expect the #subtasks smaller than the #tasks.
Or at least different as it is a multicore app.


There were 4K subtasks running older than 48hours. I suspect most have been suspended or disconnected. These 4K are being counted as part of the queue. I am looking at ways to handle this situation.
ID: 37529 · Report as offensive     Reply Quote
peterfilla

Send message
Joined: 2 Jan 11
Posts: 23
Credit: 5,986,899
RAC: 0
Message 37531 - Posted: 4 Dec 2018, 17:22:44 UTC

I upgraded VBox to Vers. 5.2.22 - and now my JOBs are running . . . ist this the solution ??!!
ID: 37531 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,483,743
RAC: 104,419
Message 37532 - Posted: 4 Dec 2018, 17:30:49 UTC - in response to Message 37531.  

I upgraded VBox to Vers. 5.2.22 - and now my JOBs are running . . . ist this the solution ??!!
I think this was rather coincidence :-)
ID: 37532 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,040,502
RAC: 136,849
Message 37533 - Posted: 4 Dec 2018, 17:38:13 UTC - in response to Message 37531.  

I upgraded VBox to Vers. 5.2.22 - and now my JOBs are running . . . ist this the solution ??!!

No.
It's not a client side issue.
The project server simply can't satisfy the demand for subtasks.

Sounds easy but in detail it seems to be rather complex to find the right settings.
ID: 37533 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Theory Application : New version 263.90


©2024 CERN