Questions and Answers : Windows : All vBox WU in error
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
vigilian

Send message
Joined: 23 Jul 05
Posts: 53
Credit: 2,707,793
RAC: 0
Message 43949 - Posted: 19 Dec 2020, 10:46:39 UTC - in response to Message 43948.  
Last modified: 19 Dec 2020, 10:56:06 UTC

for example one of the VM is staying on cloud-init for what I can see from the screen.
the other one succeeded to go throug happarently and is running libvirtd

EDIT: and now the one which was stranded on cloud-init stopped, boinc still showing as calculation in action but the VM is stopped according to virtualbox and the other is still running.
I've changed now the pref from 2 jobs to 1

the one with cloud-init stranded resulted in error in boinc finally. Cloud-init is only related to network connectivity or something else too?
ID: 43949 · Report as offensive     Reply Quote
vigilian

Send message
Joined: 23 Jul 05
Posts: 53
Credit: 2,707,793
RAC: 0
Message 43950 - Posted: 19 Dec 2020, 10:59:33 UTC

Also a little remark, 1,850% doesn't seem to be the same point in every VM. Because here it stopped at the same number but apparently not the same stage inside the VM
ID: 43950 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2491
Credit: 247,718,161
RAC: 123,919
Message 43951 - Posted: 19 Dec 2020, 12:03:38 UTC - in response to Message 43950.  

Guess your VM just started.
It allocates 2 GB RAM to be used by the VM as can be seen here:
2020-12-19 11:33:47 (68752): Setting Memory Size for VM. (2048MB)

After a while it gets a signal to pause:
2020-12-19 11:40:52 (68752): VM state change detected. (old = 'Running', new = 'Paused')

This causes a snapshot of the VM (2 GB) to be written to the disk for later use.
Just a few seconds later it gets a resume signal:
2020-12-19 11:41:02 (68752): VM state change detected. (old = 'Paused', new = 'Running')

This causes the snapshot to be loaded from the disk.


The same happens over and over again and finally the processes inside the VM don't get enough time to do what they should, e.g. to update the heartbeat file.
That file signals the vboxwrapper (outside the VM) that the processes inside the VM are healthy.
If the heartbeat file is not updated regularly vboxwrapper will consider the VM to be lost and end the task:
2020-12-19 11:55:15 (68752): VM Heartbeat file specified, but missing.
2020-12-19 11:55:15 (68752): VM Heartbeat file specified, but missing file system status. (errno = '2')
2020-12-19 11:55:15 (68752): Powering off VM.



Not the main reason but at least part of the problem can be the CPU throttle of only 65% since this causes all processes being children of your BOINC client to run at limited speed, In this case the VM as well as vboxwrapper.

Your job would be to find out why BOINC sends so many pause/resume commands.
This could be caused by low RAM settings allowed to be used by BOINC but nobody else but you knows what else is running on that computer.




1,850% doesn't seem to be the same point in every VM.

This is a guess made by BOINC, not a real progress.
Many times explained here in the forum.
Looking at this number will not help you to solve any problem.
ID: 43951 · Report as offensive     Reply Quote
vigilian

Send message
Joined: 23 Jul 05
Posts: 53
Credit: 2,707,793
RAC: 0
Message 43952 - Posted: 19 Dec 2020, 20:41:50 UTC - in response to Message 43951.  

1,850% doesn't seem to be the same point in every VM.


This is a guess made by BOINC, not a real progress.
Many times explained here in the forum.
Looking at this number will not help you to solve any problem.


mmmh okey sorry didn't know that.

Well that doesn't seem really accurate isn't it? Because even with a cpu limitation, on a SSD unless this is ten thousands very small files to load, on a ssd it should more or less instantaneously.. I will try to tweak it so that maybe it can have longer time but I definitely think there is something wrong in the VM and how they are built.
Because as I said before it's not the first time we are discussing the same problem on this forum.
Like the other guy said, people are trying to help by allocating ressources on their own time and take on their own time to make it work so everything should be optimized or going that way to facilitate that. Not complicated things by doing things as "yeah but to make it work you need to fill in this check list of an hundreds elements long". But that is as a side note and it's to explain or to empower the message of the other guy. And I would like that contributors and dev keep that in mind at all time including in their code, not only in helping.

what is the cpu time in the stats task on the website and the other stat? Because they are very close. So unless this is also an estimation, this is also taking place at night while there is only windows services activities, and the other VM are not doing munch either. So on a common sense note, I think we can both agree this is not normal correct?
ID: 43952 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 43953 - Posted: 20 Dec 2020, 1:31:14 UTC
Last modified: 20 Dec 2020, 1:41:08 UTC

2020-12-19 20:38:02 (28448): Setting CPU throttle for VM. (65%)


Vm machines are fragile and do not give much room from host to make there own optimizations. Reduce cpu or ram take in other limits and would make it work in state it is not intended to be in. A bad combo would pause/resume in these tasks.
SSD would not help to deal with these task, they would suffer and interrupted and cause unusefull instance in virtualbox clogging it up. These would need to cleared out by hand and system restart.

Set boinc to default and clear the out then restart system.
ID: 43953 · Report as offensive     Reply Quote
vigilian

Send message
Joined: 23 Jul 05
Posts: 53
Credit: 2,707,793
RAC: 0
Message 43954 - Posted: 20 Dec 2020, 8:58:48 UTC - in response to Message 43953.  

Well you have maybe XEON cpu with 64 cores or maybe you only do word processing things on your hosts but I don't .
Plus strangely I don't have those problems with VM which are way more higher in demands in terms of horsepower, which are paused, resume, and with cpu execution cap also and there is no problem with them. So maybe as you said those VM are fragile, but if they are then there is a terrible problem with the guest OS which are not correctly optimized.

But anyway your arguments doesn't have any grip on reality. Why? Because for the last 6 hours, they haven't been enough paused and resumed processes and the VM from CERN could used 100% of the cpu time and still all the tasks have resulted in error and more than a few of them didn't had any pauses.

So in place again, of pointing out the habits or the system of the people trying to help the community, just use some common sense please and see that there is too much of the same occurrence that it could be the cpu cap.
ID: 43954 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 43955 - Posted: 20 Dec 2020, 10:18:09 UTC

Well you have maybe XEON cpu with 64 cores or maybe you only do word processing things on your hosts but I don't .

This is not about my hardware or what they processing. It have no affect solving issue here.

But anyway your arguments doesn't have any grip on reality. Why? Because for the last 6 hours, they haven't been enough paused and resumed processes and the VM from CERN could used 100% of the cpu time and still all the tasks have resulted in error and more than a few of them didn't had any pauses.


2020-12-20 09:40:29 (48180): VM state change detected. (old = 'Running', new = 'Paused')
2020-12-20 09:40:39 (48180): VM state change detected. (old = 'Paused', new = 'Running')
2020-12-20 09:51:31 (48180): VM state change detected. (old = 'Running', new = 'Paused')
2020-12-20 09:51:41 (48180): VM state change detected. (old = 'Paused', new = 'Running')
2020-12-20 09:53:14 (48180): Guest Log: [INFO] Mounting the shared directory


2020-12-20 09:54:02 (48180): VM Heartbeat file specified, but missing.
2020-12-20 09:54:02 (48180): VM Heartbeat file specified, but missing file system status. (errno = '2') 


And vm still limit on last task.

2020-12-20 09:33:39 (48180): Preference change detected
2020-12-20 09:33:39 (48180): Setting CPU throttle for VM. (65%)
2020-12-20 09:33:39 (48180): Setting checkpoint interval to 3600 seconds. (Higher value of (Preference: 3600 seconds) or (Vbox_job.xml: 600 seconds))


So in place again, of pointing out the habits or the system of the people trying to help the community, just use some common sense please and see that there is too much of the same occurrence that it could be the cpu cap.


You have set to throttle cpu and vm machines would have hard to handle it. You could set it back to 100%.
It is up to you. I can't help you if your not open to change to default settings.
ID: 43955 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1364
Credit: 9,108,524
RAC: 2,671
Message 43956 - Posted: 20 Dec 2020, 10:56:07 UTC

vigilian,

For me it looks like you are running a lot of CMS-tasks at once.

The error "VM Heartbeat file specified, but missing" is caused by the lack of fast enough interaction between VBoxService and BOINC's vboxwrapper.
Due to that 'heartbeat missing' BOINC stops the task, because BOINC means that the VM is not respomding to VBox manage commands and don't want to waste time.

You could first try 1 single Theory task and then 1 single CMS-task.
CMS is more tricky for network issues.
ID: 43956 · Report as offensive     Reply Quote
vigilian

Send message
Joined: 23 Jul 05
Posts: 53
Credit: 2,707,793
RAC: 0
Message 43968 - Posted: 21 Dec 2020, 9:52:40 UTC - in response to Message 43955.  

Well you have maybe XEON cpu with 64 cores or maybe you only do word processing things on your hosts but I don't .

This is not about my hardware or what they processing. It have no affect solving issue here.

But anyway your arguments doesn't have any grip on reality. Why? Because for the last 6 hours, they haven't been enough paused and resumed processes and the VM from CERN could used 100% of the cpu time and still all the tasks have resulted in error and more than a few of them didn't had any pauses.


2020-12-20 09:40:29 (48180): VM state change detected. (old = 'Running', new = 'Paused')
2020-12-20 09:40:39 (48180): VM state change detected. (old = 'Paused', new = 'Running')
2020-12-20 09:51:31 (48180): VM state change detected. (old = 'Running', new = 'Paused')
2020-12-20 09:51:41 (48180): VM state change detected. (old = 'Paused', new = 'Running')
2020-12-20 09:53:14 (48180): Guest Log: [INFO] Mounting the shared directory


2020-12-20 09:54:02 (48180): VM Heartbeat file specified, but missing.
2020-12-20 09:54:02 (48180): VM Heartbeat file specified, but missing file system status. (errno = '2') 


And vm still limit on last task.

2020-12-20 09:33:39 (48180): Preference change detected
2020-12-20 09:33:39 (48180): Setting CPU throttle for VM. (65%)
2020-12-20 09:33:39 (48180): Setting checkpoint interval to 3600 seconds. (Higher value of (Preference: 3600 seconds) or (Vbox_job.xml: 600 seconds))


So in place again, of pointing out the habits or the system of the people trying to help the community, just use some common sense please and see that there is too much of the same occurrence that it could be the cpu cap.


You have set to throttle cpu and vm machines would have hard to handle it. You could set it back to 100%.
It is up to you. I can't help you if your not open to change to default settings.


Yes and you are not listening. Because Again I don't need this project and I don't care if it works or not. It's actually me helping the project not the other way around. That's the common sense you should actually have.
As I said I can't put 100% of the CPU allocated to BOINC. That would be non sense. And a small remark that I can give you by the way. Putting a 100% cpu on project like that clogged the system anyway because you are not giving enough space for other tasks like windows background task to breathe. Because when you put 25% + 65 % it's equal to 90% and let suppose that I put no restrictions what happens then? Well, what happens it's that you got an unresponsive system. And it has been this way since I first try to help this project when the i7-2600k has been launched. That's a reality fact whether you like it or not.

Crystall pellet is the only one who has not his story on loop.

Well the message from the tasks are not accurate do you know why?

20/12/2020 08:54:46 | LHC@home | Requesting new tasks for CPU
20/12/2020 08:54:48 | LHC@home | Scheduler request completed: got 1 new tasks
20/12/2020 08:54:48 | LHC@home | Project requested delay of 6 seconds
20/12/2020 09:03:01 |  | Suspending computation - CPU is busy
20/12/2020 09:03:11 |  | Resuming computation
20/12/2020 09:07:24 | LHC@home | Computation for task CMS_3501742_1608447092.058199_0 finished
20/12/2020 09:07:24 | LHC@home | Starting task CMS_3496799_1608446491.320197_0
20/12/2020 09:08:35 | LHC@home | Sending scheduler request: To report completed tasks.
20/12/2020 09:08:35 | LHC@home | Reporting 1 completed tasks
20/12/2020 09:08:35 | LHC@home | Requesting new tasks for CPU
20/12/2020 09:08:37 | LHC@home | Scheduler request completed: got 1 new tasks
20/12/2020 09:08:37 | LHC@home | Project requested delay of 6 seconds
20/12/2020 09:14:40 | LHC@home | Computation for task CMS_3511089_1608448592.981055_0 finished
20/12/2020 09:14:40 | LHC@home | Starting task CMS_3484353_1608444690.176501_0
20/12/2020 09:16:03 | LHC@home | Sending scheduler request: To report completed tasks.
20/12/2020 09:16:03 | LHC@home | Reporting 1 completed tasks
20/12/2020 09:16:03 | LHC@home | Requesting new tasks for CPU
20/12/2020 09:16:05 | LHC@home | Scheduler request completed: got 1 new tasks
20/12/2020 09:16:05 | LHC@home | Project requested delay of 6 seconds
20/12/2020 09:18:50 | World Community Grid | Computation for task MIP1_00327156_1304_0 finished
20/12/2020 09:18:50 | LHC@home | Starting task CMS_3526052_1608450695.312753_0
20/12/2020 09:18:52 | World Community Grid | Started upload of MIP1_00327156_1304_0_r1188955442_0
20/12/2020 09:18:54 | World Community Grid | Finished upload of MIP1_00327156_1304_0_r1188955442_0
20/12/2020 09:19:21 | LHC@home | Computation for task CMS_3511101_1608448593.060255_0 finished
20/12/2020 09:19:21 | World Community Grid | Starting task MCM1_0169779_2766_1
20/12/2020 09:21:01 | LHC@home | Sending scheduler request: To report completed tasks.
20/12/2020 09:21:01 | LHC@home | Reporting 1 completed tasks
20/12/2020 09:21:01 | LHC@home | Requesting new tasks for CPU
20/12/2020 09:21:03 | LHC@home | Scheduler request completed: got 1 new tasks
20/12/2020 09:21:03 | LHC@home | Project requested delay of 6 seconds
20/12/2020 09:25:44 | GPUGRID | Sending scheduler request: Requested by project.
20/12/2020 09:25:44 | GPUGRID | Requesting new tasks for NVIDIA GPU
20/12/2020 09:25:46 | GPUGRID | Scheduler request completed: got 0 new tasks
20/12/2020 09:25:46 | GPUGRID | Project is temporarily shut down for maintenance
20/12/2020 09:25:46 | GPUGRID | Project requested delay of 3600 seconds
20/12/2020 09:33:09 | LHC@home | Computation for task CMS_3496799_1608446491.320197_0 finished
20/12/2020 09:33:09 | LHC@home | Starting task CMS_3513309_1608448893.228973_0
20/12/2020 09:35:03 | LHC@home | Sending scheduler request: To report completed tasks.
20/12/2020 09:35:03 | LHC@home | Reporting 1 completed tasks
20/12/2020 09:35:03 | LHC@home | Requesting new tasks for CPU
20/12/2020 09:35:05 | LHC@home | Scheduler request completed: got 1 new tasks
20/12/2020 09:35:05 | LHC@home | Project requested delay of 6 seconds
20/12/2020 09:40:26 | LHC@home | Computation for task CMS_3484353_1608444690.176501_0 finished
20/12/2020 09:40:26 | LHC@home | Starting task CMS_3523673_1608450395.107557_0
20/12/2020 09:40:28 |  | Suspending computation - CPU is busy
20/12/2020 09:40:38 |  | Resuming computation
20/12/2020 09:41:32 |  | Suspending GPU computation - computer is in use
20/12/2020 09:42:16 | LHC@home | Sending scheduler request: To report completed tasks.
20/12/2020 09:42:16 | LHC@home | Reporting 1 completed tasks
20/12/2020 09:42:16 | LHC@home | Requesting new tasks for CPU
20/12/2020 09:42:18 | LHC@home | Scheduler request completed: got 1 new tasks
20/12/2020 09:42:18 | LHC@home | Project requested delay of 6 seconds
20/12/2020 09:44:34 | LHC@home | Computation for task CMS_3526052_1608450695.312753_0 finished
20/12/2020 09:45:33 | LHC@home | update requested by user
20/12/2020 09:45:34 | LHC@home | Sending scheduler request: Requested by user.
20/12/2020 09:45:34 | LHC@home | Reporting 1 completed tasks
20/12/2020 09:45:34 | LHC@home | Requesting new tasks for CPU
20/12/2020 09:45:35 | LHC@home | Scheduler request completed: got 0 new tasks
20/12/2020 09:45:35 | LHC@home | No tasks sent
20/12/2020 09:45:35 | LHC@home | This computer has reached a limit on tasks in progress
20/12/2020 09:45:35 | LHC@home | Project requested delay of 6 seconds
20/12/2020 09:51:29 |  | Suspending computation - CPU is busy
20/12/2020 09:51:39 |  | Resuming computation
20/12/2020 09:59:17 | LHC@home | Computation for task CMS_3513309_1608448893.228973_0 finished
20/12/2020 09:59:17 | LHC@home | Starting task CMS_3515639_1608449193.816402_0
20/12/2020 10:00:42 | LHC@home | Sending scheduler request: To report completed tasks.
20/12/2020 10:00:42 | LHC@home | Reporting 1 completed tasks
20/12/2020 10:00:42 | LHC@home | Requesting new tasks for CPU
20/12/2020 10:00:45 | LHC@home | Scheduler request completed: got 0 new tasks

Because that's the accurate messages.
Which means that the VMs have 11 minutes to be restored.... from an SSD.... And you don't find it odd? not the slightest? So a VM is telling you that it has between 40:39 and 51:31 10 minutes 52sec to restore itself which is longer than any VM in the world to boot or restore from a paused status and you don't see the problem? really?
What does it do during 10 minutes? Because it sure is launched according to BOINC itself and the logs in it.

And you don't find it odd either that each time the VM needs to be started, the computation has to be paused even when nothing else is running? So you actually telling me that an AMD 3700X with 16 cores has to use more than 65% of it's overall ressources to launch this small VM which is again, not the heaviest VM in the world, it has no GUI, it's a headless server with practically no drivers to load, nothing.... And for you it's normal? like every day as usual? Seriously?

Plus I've asked 2 precise questions here:


what is the cpu time in the stats task on the website and the other stat? Because they are very close. So unless this is also an estimation, this is also taking place at night while there is only windows services activities, and the other VM are not doing munch either. So on a common sense note, I think we can both agree this is not normal correct?


So in place of being on loop just answer those questions precisely.
ID: 43968 · Report as offensive     Reply Quote
vigilian

Send message
Joined: 23 Jul 05
Posts: 53
Credit: 2,707,793
RAC: 0
Message 43969 - Posted: 21 Dec 2020, 10:10:50 UTC - in response to Message 43968.  

And yo udon't find it odd either that each time it's going into calculation error at the same % in boinc which is the exact same time of cpu stats on the task itself and also at the exact same line mounting shared directory ?
It doesn't strike you as a problem? At any moment?

There is a problem with those VM and the interaction in windows. That's the real problem, whether it's an interaction with a third party program or something else and that's probably why it can't find it in the shared directory, since I guess this is where the heartbeat file is (correct?), and so that's resulting into a calculation error.

Just to continue on a common sense note, don't forget that I'm giving time into writing this too. I could simply consider this project buggy and just deleted it. So in place, again, of giving the same bullshit argument that doesn't have any grip in reality (like yeaaaah SSD doesn't improve in any way access disk right ^^), maybe just maybe look at the stats of the VM and the actual surrounding elements of the tasks and not only the message that an inside OS guest have as information (which are by essence incomplete).

What would actually be the smart way to give me advice is to point me to the ticket system for those VM so that I can fill a ticket so that we can actually look into this more deeply with maybe another way to collect accurate logs on what's happening here.
ID: 43969 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2184
Credit: 172,752,337
RAC: 34,730
Message 43970 - Posted: 21 Dec 2020, 10:20:33 UTC - in response to Message 43969.  

Have understanding for your arguments.
Can you testing instead of CMS for example Theory. (C.P. asked you also.)
When you change your CPU throttle from 65% to 95% and running a Theory. Is this successful?
ID: 43970 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1364
Credit: 9,108,524
RAC: 2,671
Message 43971 - Posted: 21 Dec 2020, 10:24:18 UTC - in response to Message 43968.  
Last modified: 21 Dec 2020, 17:30:35 UTC

So in place of being on loop just answer those questions precisely.
Could you please act a bit more friendly when someone tries to help you?


From your log:
Suspending computation - CPU is busy

This is because of a BOINC computing preference setting.

Try a few BOINC-settings:
- Suspend when non-BOINC CPU-usage is above --> set this to 0 (zero)
- Set BOINC to use 100% of CPU-time
- Reduce 'Use at most ...% of the CPUs' when your system is getting sluggish.
- Tick 'Leave non-GPU tasks in memory while suspended'

... and don't start several VM's at once.
ID: 43971 · Report as offensive     Reply Quote
vigilian

Send message
Joined: 23 Jul 05
Posts: 53
Credit: 2,707,793
RAC: 0
Message 43979 - Posted: 22 Dec 2020, 10:28:52 UTC

Didn't got the time to put boinc on exclusive 100% but 100% cpu time is set up for 2 days now it didn't change a thing.

What I had the time to do this morning is giving 25 minutes of my time to watch another task to fail without doing anything. 25 minutes is the exact amount of time that needs the project to fail EACH TIME. Not from time to time, not sometimes, EACH TIME.

so let's review it together shall we:

22/12/2020 09:56:06 | GPUGRID | No tasks sent
22/12/2020 09:56:06 | GPUGRID | This computer has reached a limit on tasks in progress
22/12/2020 09:56:06 | GPUGRID | Project requested delay of 31 seconds
22/12/2020 10:04:42 |  | Suspending GPU computation - computer is in use
22/12/2020 10:10:19 | LHC@home | Computation for task CMS_472945_1608624847.991143_0 finished
22/12/2020 10:11:44 | LHC@home | Sending scheduler request: To report completed tasks.
22/12/2020 10:11:44 | LHC@home | Reporting 1 completed tasks
22/12/2020 10:11:44 | LHC@home | Requesting new tasks for CPU
22/12/2020 10:11:46 | LHC@home | Scheduler request completed: got 1 new tasks
22/12/2020 10:11:46 | LHC@home | Project requested delay of 6 seconds
22/12/2020 10:11:48 | LHC@home | Starting task CMS_484666_1608626650.023215_0
22/12/2020 10:13:02 | World Community Grid | Computation for task MIP1_00327339_0391_0 finished
22/12/2020 10:13:04 | World Community Grid | Started upload of MIP1_00327339_0391_0_r652393296_0
22/12/2020 10:13:08 | World Community Grid | Finished upload of MIP1_00327339_0391_0_r652393296_0
22/12/2020 10:14:04 | World Community Grid | project suspended by user
22/12/2020 10:14:06 | GPUGRID | project suspended by user
22/12/2020 10:14:07 | LHC@home | Sending scheduler request: To fetch work.
22/12/2020 10:14:07 | LHC@home | Requesting new tasks for CPU
22/12/2020 10:14:09 | LHC@home | Scheduler request completed: got 0 new tasks
22/12/2020 10:14:09 | LHC@home | No tasks sent
22/12/2020 10:14:09 | LHC@home | This computer has reached a limit on tasks in progress
22/12/2020 10:14:09 | LHC@home | Project requested delay of 6 seconds
22/12/2020 10:28:21 | LHC@home | Sending scheduler request: To fetch work.
22/12/2020 10:28:21 | LHC@home | Requesting new tasks for CPU
22/12/2020 10:28:22 | LHC@home | Scheduler request completed: got 0 new tasks
22/12/2020 10:28:22 | LHC@home | No tasks sent
22/12/2020 10:28:22 | LHC@home | This computer has reached a limit on tasks in progress
22/12/2020 10:28:22 | LHC@home | Project requested delay of 6 seconds
22/12/2020 10:29:04 |  | Resuming GPU computation
22/12/2020 10:29:12 |  | Suspending GPU computation - computer is in use
22/12/2020 10:32:19 |  | Resuming GPU computation
22/12/2020 10:32:24 |  | Suspending GPU computation - computer is in use
22/12/2020 10:37:30 | LHC@home | Computation for task CMS_484666_1608626650.023215_0 finished
22/12/2020 10:38:39 | LHC@home | Sending scheduler request: To report completed tasks.
22/12/2020 10:38:39 | LHC@home | Reporting 1 completed tasks
22/12/2020 10:38:39 | LHC@home | Requesting new tasks for CPU
22/12/2020 10:38:41 | LHC@home | Scheduler request completed: got 1 new tasks
22/12/2020 10:38:41 | LHC@home | Project requested delay of 6 seconds
22/12/2020 10:38:43 | LHC@home | Starting task CMS_482381_1608626349.863761_0
22/12/2020 10:38:51 | LHC@home | Sending scheduler request: To fetch work.
22/12/2020 10:38:51 | LHC@home | Requesting new tasks for CPU
22/12/2020 10:38:52 | LHC@home | Scheduler request completed: got 0 new tasks
22/12/2020 10:38:52 | LHC@home | No tasks sent
22/12/2020 10:38:52 | LHC@home | This computer has reached a limit on tasks in progress
22/12/2020 10:38:52 | LHC@home | Project requested delay of 6 seconds


as you can see I've been patient enough to not disturb the process.
there is no computation suspending whatsoever, I even suspended the other projects.
this is I guess the task in question :
https://lhcathome.cern.ch/lhcathome/result.php?resultid=292553098


and I'm sorry guys but when a job is badly done I'm just mad and I'm working this field since I'm 18 years old and that's a long time for someone of my age.


and THIS is ABNORMAL:
292553098 	150175186 	10670359 	22 Dec 2020, 9:11:45 UTC 	22 Dec 2020, 9:38:40 UTC 	Error while computing 	1,528.14 	20.84 	--- 	CMS Simulation v50.00 (vbox64)
windows_x86_64
292551833 	150174199 	10670359 	22 Dec 2020, 8:42:43 UTC 	22 Dec 2020, 9:11:45 UTC 	Error while computing 	1,528.12 	22.95 	--- 	CMS Simulation v50.00 (vbox64)
windows_x86_64
292550046 	150173024 	10670359 	22 Dec 2020, 8:11:59 UTC 	22 Dec 2020, 8:42:42 UTC 	Error while computing 	1,528.18 	22.27 	--- 	CMS Simulation v50.00 (vbox64)
windows_x86_64
292550038 	150173016 	10670359 	22 Dec 2020, 7:45:07 UTC 	22 Dec 2020, 8:11:59 UTC 	Error while computing 	1,508.77 	20.88 	--- 	CMS Simulation v50.00 (vbox64)
windows_x86_64
292549266 	150172620 	10670359 	22 Dec 2020, 7:11:39 UTC 	22 Dec 2020, 7:45:07 UTC 	Error while computing 	1,528.54 	22.45 	--- 	CMS Simulation v50.00 (vbox64)
windows_x86_64
292548664 	150172248 	10670359 	22 Dec 2020, 6:44:30 UTC 	22 Dec 2020, 7:11:39 UTC 	Error while computing 	1,528.53 	21.52 	--- 	CMS Simulation v50.00 (vbox64)
windows_x86_64
292541108 	150167394 	10670359 	22 Dec 2020, 6:17:34 UTC 	22 Dec 2020, 6:44:30 UTC 	Error while computing 	1,529.67 	24.64 	--- 	CMS Simulation v50.00 (vbox64)
windows_x86_64
292546876 	150171154 	10670359 	22 Dec 2020, 5:49:51 UTC 	22 Dec 2020, 6:17:34 UTC 	Error while computing 	1,527.65 	23.44 	--- 	CMS Simulation v50.00 (vbox64)
windows_x86_64


and I have pages and pages of that.
So I don't want to hear anything more about: "yeah but you know you've put an executioncap, but you don't have enough memory, but blahblahblah"

That much of an occurrence with that much of a precise momentum for each of the task, well no that's clearly not hazardous, and since no one wants to answer me about whether it's an eval and not accurate, well I will take that as a "no it's not an eval, it's accurate cpu time processing".

There is something wrong with those CMS VMs. And I clearly don't care whether it's working to the few hundreds, maybe thousands of regular participants
because let's be honnest, we are very selective club here between those in IT and dev who actually knows something about the seti-program and who came all the way to BOINC then CERN project.... and I even know belgian physicists (more than one from various ages) who actually DON'T KNOW the CERN project around BOINC even they actually use participate in LHC simulation.
and that is not working for a few minorities. If it doesn't work everywhere, then there is a problem in the programming that's it. And it has always been like that. Whether it's a lack of conditions in the programming to prevent something from happening or whether it's a lack of error management.


and yes crystal pellet I'm a bit harsh, even so I recognized earlier that you were the only one who were not on a loop (which was a compliment by the way). But it's like telling to a customer that he's crazy that what he's telling you is happening, doesn't actually happen. That's the same kind of idiotic behavior or nonsense. You start by believing the customer of what's happening and by ask yourself if the product can actually handle that situation and you start by revising your code not the other way around.

So just let's assume for once that I'm right.

The few possibilities here are:
- the CERN vm can't handle some of the VBoxManage parameters which is supposed to actually support based on BOINC documentation.
- there is a third program interfering with the mounting of shared folders of that specific VM for whatever reason there is.
- there is a bug in virtualbox 6.1.16
- there is a bug in the OS version that use CERN vm or something else from windows 10 in correlation to the CERN guest OS.

either way the log system in those VM is insufficient and incomplete. And it can't be a corruption of the VM because if it was that, it wouldn't start and I definitely reset the project several times already. Nor it is a virtualbox installation problem because I would be overwhelmed by problems on my numerous other VMs or other errors. And anyway I already reinstalled virtualbox more than once.

Every project are ticked in that school profile in LHC preferences. So I should receive every project but I only receives CMS and sixtrack(from time to time). So I will untick CMS to see if I'm receiving theory project but I highly doubt so. I would have received some by now.

And I literally don't have time to play with BOINC parameters all day. Devs here or moderators should already be grateful that someone is willingly able to put that much of energy to actually test the project, write back here and make extensive testing and try to explain to them how they are wrong and in loop.


A few remarks though,
While I was waiting this task to fail,
I was doing some maintenance on a ubuntu VM 20.10 and I mentionning this for a very precise reason. I restarted several times this VM and as you can see there hasn't been one single computation stopped => which proves that the start of a VM doesn't need 65% of a 3700X AMD cpu to start. Which in itself means that gunde was wrong by saying, yeah it's certainly the cpu cap blahblahblah. No it is not. And the computation stops at the start of CMS tasks (some of them not all of them), is not normal either, especially since those VM aren't as power hungry than a ubuntu 20.10 with GUI activated.

And before crystal pellet is saying anything about, "yeah but you used several VMs at once" Well there was only one CMS task running at the time. What you've seen yesterday was a result of me changing of profile from home to school and I didn't notice at the time that the parameter from school was at 2 tasks at once. I changed it during that day to 1. But yes sorry, but I'm working guys and my VMs need to be up at all times. Which shouldn't be a problem whatsoever. Since when are we doubting virtualbox, which is working for millions of customers, to handle several VMs at once? It can only mean one thing, again that there is a problem INSIDE the CERN vm or a problem in correlation to that specific VMs not something else.
ID: 43979 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2184
Credit: 172,752,337
RAC: 34,730
Message 43980 - Posted: 22 Dec 2020, 10:41:33 UTC

ID: 43980 · Report as offensive     Reply Quote
vigilian

Send message
Joined: 23 Jul 05
Posts: 53
Credit: 2,707,793
RAC: 0
Message 43981 - Posted: 22 Dec 2020, 11:10:46 UTC - in response to Message 43980.  

well yeah sure I could use the 6.1.11 but I'm not having the habit to downgrade especially when at each iteration there are so much CVE corrected and bugs.

22/12/2020 12:02:23 | LHC@home | Scheduler request completed: got 0 new tasks
22/12/2020 12:02:23 | LHC@home | No tasks sent
22/12/2020 12:02:23 | LHC@home | No tasks are available for SixTrack
22/12/2020 12:02:23 | LHC@home | No tasks are available for sixtracktest
22/12/2020 12:02:23 | LHC@home | No tasks are available for Theory Simulation
22/12/2020 12:02:23 | LHC@home | No tasks are available for ATLAS Simulation
22/12/2020 12:02:23 | LHC@home | Project requested delay of 6 seconds


seems important to me or at least I'm using all those functions:
Serial: Fixed blocking a re-connect when TCP mode is used (bug #19878) 
HPET: Fixed inability of guests to use the last timer 
Linux host and guest: Support kernel version 5.9 (bug #19845) 
Linux guest: Fixed Guest additions build for RHEL 8.3 beta (bug #19863) 
Linux guest: Fixed VBoxService crashing in the CPU hot-plug service under certain circumstances during a CPU hot-unplug event (bugs #19902 and #19903) 
GUI: Fixes file name changes in the File location field when creating Virtual Hard Disk (bug #19286) 
Linux host and guest: Linux kernel version 5.8 support 

Guest Additions: Improved resize coverage for VMSVGA graphics controller
Guest Additions: Fixed issues detecting guest additions ISO at runtime 
VBoxManage: Fixed command option parsing for the "snapshot edit" sub-command
VBoxManage: Fixed crash of 'VBoxManage internalcommands repairhd' when processing invalid input (bug #19579)
Guest Additions, 3D: New experimental GLX graphics output
Guest Additions, 3D: Fixed releasing texture objects, which could cause guest crashes 


If I have the time sure I will do it. But again, I'm working...
ID: 43981 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2184
Credit: 172,752,337
RAC: 34,730
Message 43982 - Posted: 22 Dec 2020, 11:36:16 UTC - in response to Message 43981.  

If I have the time sure I will do it. But again, I'm working...

Ok we are waiting therefore, and please only ONE Theory so long!!
ID: 43982 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2491
Credit: 247,718,161
RAC: 123,919
Message 43983 - Posted: 22 Dec 2020, 13:01:11 UTC - in response to Message 43979.  

@vigilian

You introduced yourself as "the perfect pro" knowing everything about IT.
Not for the first time you implicitly call all others "incompetent fools", developers at CERN providing the vbox image, volunteers just attaching their computer to the project and returning valid results and the people trying to help you.


Looking at some numbers tell everybody those "fools" deliver lots of good results
CMS: 163 users per day
Theory: 773 users per day
ATLAS: 271 users per day

In case of CMS the Grafana-Monitoring tells us that every 40 s a valid subtask result arrives at CERN.


My common sense tells me none of THOSE volunteers is a fool!



What you have been told so far are possible reasons that might help in most cases but since you did not try the tips for weeks they have been repeated a couple of times.
Out of experience other problems are nearly always home made and as a real pro you would be able to do systematic tests to identify the reasons.


Your own logfiles tell you that the VMs start fine but vboxwrapper shuts them down when a watchdog timer ends, hence all have (nearly) the same runtime.
The watchdog checks for an updated heartbeat file in the folder .../slots/x/shared.
That folder is used in both directions, to transfer the job details to the VM and to transfer files from the VM to the host environment.
Your logfiles compared to other user's logs show some differences at the point when the folder should be mounted.
This points out a permission problem.
It's your job to ensure the user account running the BOINC client and the VM have read/write access to the shared folder.
It's also your job to find out whether an init_data.xml is written to that folder when a task starts.

At least your recent logs look like the VM doesn't get the job data from init_data.xml and as a result it doesn't go forward in the setup process.
This ends in a missing heartbeat file.
ID: 43983 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1364
Credit: 9,108,524
RAC: 2,671
Message 43984 - Posted: 22 Dec 2020, 13:08:55 UTC

The one and only reason so far when we got 'VM Heartbeat file specified, but missing' was when a system was used extensively and there was too short time for communication between vboxwrapper and VBoxService.

Since all your tasks fail at the same moment, I come to the conclusion that the 'heartbeat' file is not there at all or not accessible. Is it created in the shared folder?
Your run times are about 25 minutes, 5 minutes setting up the VM and then 20 minutes (= 1200 seconds heartbeat interval).

2020-12-22 11:33:20 (56104): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2020-12-22 11:53:34 (56104): VM Heartbeat file specified, but missing.
ID: 43984 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 43985 - Posted: 22 Dec 2020, 17:00:29 UTC
Last modified: 22 Dec 2020, 17:52:57 UTC

Thanks you have changed to default and so this issue is solved regarding:
2020-12-20 09:40:29 (48180): VM state change detected. (old = 'Running', new = 'Paused')
2020-12-20 09:40:39 (48180): VM state change detected. (old = 'Paused', new = 'Running')
2020-12-20 09:51:31 (48180): VM state change detected. (old = 'Running', new = 'Paused')
2020-12-20 09:51:41 (48180): VM state change detected. (old = 'Paused', new = 'Running')


II restarted several times this VM and as you can see there hasn't been one single computation stopped => which proves that the start of a VM doesn't need 65% of a 3700X AMD cpu to start. Which in itself means that gunde was wrong by saying, yeah it's certainly the cpu cap blahblahblah. No it is not. And the computation stops at the start of CMS tasks (some of them not all of them), is not normal either, especially since those VM aren't as power hungry than a ubuntu 20.10 with GUI activated..


I disagree. i never say it not main cause and computezrmle mention on post before so i did not mention it in my post. To start troubleshooting i did ask this ask you set it to be in default. in settings It run without vm being interrupted. That is all about throttling.
That it manage to start/resume is not same as throttle a process. When make call to task it pass it on to vm pause state and create snapshot on current state.

Putting a 100% cpu on project like that clogged the system

This is why we recommend to reduce cores/threads [Use at most X% of cpus] instead of using [Use at most X% in cpu time]

I restarted several times this VM and as you can see there hasn't been one single computation stopped


We can focus on other issues from log
VM Heartbeat file specified, but missing


I would leave it to Crystal and computezrmle but this issue:

2020-12-22 11:33:20 (56104): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2020-12-22 11:53:34 (56104): VM Heartbeat file specified, but missing.


From my experience this could be related to network loses connection or server it connects to.

BOINC needs a quick and easy way to know if the project's app is still running so the app periodically touches a disk file in .../../slots/shared. The period is ~60 seconds and the file is named heartbeat. Touching a file either creates the file or, if the file already exists, updates the 'last accessed" datetime.

heartbeat is zero-length (ie. it's empty). You should be able to see heartbeat in file manager. If not then either your username doesn't have the required permissions or it doesn't exist. Watch it's last accessed datetime and notice that it increases by 60 secs every 60 secs.

So the VMwrapper (or possibly the VM itelf? ) touches heartbeat every 60 secs. BOINC periodically looks at heartbeat. At that point the possible scenarios go something like this:

1. If BOINC cannot see heartbeat then it can reasonably assume either the app/VM never started or the app/VM deleted heartbeat then died.

2. If BOINC can see heartbeat and it's last accessed datetime has incremented from the previous time it looked at heartbeat then BOINC can be reasonably sure the VM still lives.

3. If heartbeat exists but last access datetime has not incremented then BOINC could assume the VM lives but it's hung or it could assume it's dead but it didn't delete heartbeat before it died.

In your case it sounds like BOINC is terminating the task because it has no heartbeat and appears to be dead.

https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4700&postid=35222
ID: 43985 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 99
Credit: 3,425,566
RAC: 10
Message 43986 - Posted: 22 Dec 2020, 23:52:37 UTC - in response to Message 43981.  

I am just tagging in that I am successfully running CMS tasks.
Four concurrent CMS tasks at a time. No other projects running. No app_config.xml for this project

I have local preferences set to Use at most 50% of the CPUS and 100% of CPU time.

Network is unrestricted

Disk is 100 GB for Boinc
Memory is 85% in use and not in use. Page/swap is 75% limit

Boinc is 7.16.11 and VirtualBox is 6.1.16

My computer https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10585495

My tasks https://lhcathome.cern.ch/lhcathome/results.php?userid=550738
ID: 43986 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Questions and Answers : Windows : All vBox WU in error


©2024 CERN